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Abstract 

Monte Carlo Tree Search (MCTS) has recently been successfully used to create strategies 
for playing imperfect-information games. Despite its popularity, there are no theoretic re¬ 
sults that guarantee its convergence to a well-defined solution, such as Nash equilibrium, 
in these games. We partially fill this gap by analysing MCTS in the class of zero-sum 
extensive-form games with simultaneous moves but otherwise perfect information. The 
lack of information about the opponent’s concurrent moves already causes that optimal 
strategies may require randomization. We present theoretic as well as empirical investi¬ 
gation of the speed and quality of convergence of these algorithms to the Nash equilibria. 
Primarily, we show that after minor technical modifications, MCTS based on any (ap¬ 
proximately) Hannan consistent selection function always converges to an (approximate) 
subgame perfect Nash equilibrium. Without these modifications, Hannan consistency is 
not sufficient to ensure such convergence and the selection function must satisfy additional 
properties, which empirically hold for the most common Hannan consistent algorithms. 

Keywords: Nash Equilibrium, Extensive Eorm Games, Simultaneous Moves, Zero Sum, 
Hannan Consistency 


1. Introduction 


Monte Carlo tree search (MCTS) is a very popular algorithm which recently caused a 
significant jump in performance of the state-of-the-art solvers for many perfect information 


problems, such as the game of Go (Geliy and Silver 

2011 

), or domain-independent planning 

under uncertainty ( 

Keller and Eyerich 

2012 

). The main idea of Monte Carlo tree search 


is running a large number of randomized simulations of the problem and learning the best 
actions to choose based on this data. It generally uses the earlier simulations to create 
statistics that help guiding the latter simulations to more important parts of the search 
space. After the success in domains with perfect information, the following research applied 
the principles of MCTS also to games with imperfect information, such as an imperfect 
information variant of Chess ( Ciancarini and Favini| 20101, or imperfect information board 

The same type of algorithms 


games (Powley et ah, 2014 Nijssen and Winands 


2012 ). 


can also be applied to real-world domains, such as robotics (Lisy et ah, 2012a) or network 


security (Lisy et ah, 2012b) 
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While all these applications show that MCTS is a promising technique also for playing 
imperfect information games, very little research has been devoted to understanding the 
fundamental principles behind the success of these methods in practice. In this paper, 
we aim to partially fill this gap. We focus on the simplest class of imperfect information 
games, which are games with simultaneous moves, but otherwise perfect information. MCTS 
algorithms has been successfully applied to many games form this class, including card 


games (Teytaud and Flory, 2011, Lanctot et ah, 2014), variants of simple computer games 
(Perick et al. , [2012 ), or in the most successful agents for General Game Playing (|Finnsson| 


and Bjornsson, 2008). This class of games is simpler than the generic imperfect information 


games, but it already includes one of the most fundamental complication caused by the 
imperfect information, which is the need for randomized (mixed) strategies. This can be 
demonstrated on the well-known game of Rock-Paper-Scissors. Any deterministic strategy 
for playing the game can be easily exploited by the opponent and the optimal strategy is 
to randomize uniformly over all actions. 


Game theory provides fundamental concepts and results that describe the optimal be¬ 
haviour in games. In zero-sum simultaneous-move games, the optimal strategy is a subgame 
perfect Nash equilibrium. For each possible situation in the game, it prescribes a strategy, 
which is optimal in several aspects. It is a strategy that gains the highest expected reward 
against its worst opponent, even if the opponent knows the strategy played by the player in 
advance. Moreover, in the zero-sum setting, even if the opponent does not play rationally, 
the strategy still guarantees at least the reward it would gain against a rational opponent. 


While computing a Nash equilibrium in a zero-sum game is a polynomially solvable 
problem (Roller and Megiddo, 1992), the games where MGTS is commonly applied are too 
large to allow even representing the Nash equilibrium strategy explicitly, which is generally 
required by exact algorithms for computing NE. Therefore, we cannot hope that MGTS 
will compute the equilibrium strategy for the complete game in the given time and space, 
but we still argue that eventual convergence to the Nash equilibrium, or some other well 
understood game theoretic concept, is a desirable property of MGTS algorithms in this 
class of games: first, an algorithm which converges to NE is more suitable in the anytime 
setting, where MGTS algorithms are most commonly used. The more time it has available 
for the computation, the closer it will be to the optimal solution. This does not always 
hold for MGTS algorithms in this class of games, which can stabilize in a fixed distance 


from an equilibrium (Shafiei et ah, 2009 Ponsen et al., 2011) or even start diverging at 


some point (Lanctot et al., 2014). Second, if the game is close to its end, it may already 


be small enough for an algorithm with guaranteed convergence to converge to almost exact 
NE and play optimally. Non-convergent MGTS algorithms can exhibit various pathologies 
in these situations. Third, understanding the fundamental game theoretic properties of the 
strategies the algorithm converge to can lead to developing better variants of MGTS for 
this class of games. 


1.1 Contributions 

We focus on two-player zero-sum extensive form games with simultaneous moves but other¬ 
wise perfect information. We denote the standard MGTS algorithm applied in this setting 
as SM-MGTS. We present a modified SM-MGTS algorithm (SM-MGTS-A), which updates 
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the selection functions by averages of the sampled values, rather than the current values 
themselves. We show that SM-MCTS-A combined with any (approximate) Hannan con¬ 
sistent (HC) selection function with guaranteed exploration converges to (approximate) 
subgame-perfect Nash equilibrium in this class of games. We present bounds on the rela¬ 
tion of the convergence rate and the eventual distance from the Nash equilibrium on the 
main parameters of the games and the selection functions. We then highlight the fact that 
without the “-A” modification, Hannan consistency of the selection function is not suffi¬ 
cient for a similar result. We present a Hannan consistent selection function that causes the 
standard SM-MCTS algorithm to converge to a solution far from the equilibrium. However, 
additional requirements on the selection function used in SM-MCTS can guarantee the con¬ 
vergence. As an example, we define the property of having unbiased payoff observations 
(UPO), and show that it is a sufficient condition for convergence of SM-MCTS. We then 
empirically conhrm that the two commonly used Hannan consistent algorithms, Exp3 and 
regret matching, satisfy this property, thus justifying their use in practice. We further show 
that the empirical speed of convergence as well as the eventual distance from the equilibrium 
is typically much better than the guarantees given by the presented theory. We empirically 
show that SM-MCTS generally converges to the same equilibrium as SM-MCTS-A, but 
does it slightly faster. 

We also give theoretical grounds for some practical improvements, which are often used 
with SM-MCTS, but have not been formally justified. These include removal of exploration 
samples from the resulting strategy and the use of average strategy instead of empirical 
frequencies of action choices. All presented theoretic results trivially apply also to perfect 
information games with sequential moves. 


1.2 Article outline 

In Sectionwe describe simultaneous-move games, the standard SM-MCTS algorithm and 
its modification SM-MCTS-A. We follow with the multi-armed bandit problem and show 
how it applies in our setting. Lastly we recall the dehnition of Hannan consistency and 
explain Exp3 and regret matching, two of the common Hannan consistent bandit algorithms. 
In Section we present the main theoretical results. First, we consider the modified SM- 
MCTS-A algorithm and present the asymptotic and finite time bounds on its convergence 
rate. We follow by defining the unbiased payoff observations property and proving the 
convergence of SM-MCTS based on HC selection functions with this property. In Section 
we provide a counterexample showing that for general Hannan consistent algorithms, SM- 
MCTS does not necessarily converge and thus the result about SM-MCTS-A from Section 
[^is optimal in the sense that it does not hold for SM-MCTS. We then present an example 
which gives a lower bound on the quality of a strategy to which SM-MCTS(-A) converges. 
In Section]^ we discuss the notion of exploitability, which measures the quality of a strategy, 
and we make a few remarks about which strategy should be considered as the output of SM- 
MCTS(-A). In Section]^ we present empirical investigation of convergence of SM-MCTS 
and SM-MCTS-A, as well as empirical confirmation of the fact that the the commonly used 
HC-algorithms guarantee the UPO property. Finally, Section summarizes the results and 
highlights open questions, which might be interesting for future research. 
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2. Background 

We now introduce the game theory fundamentals and notation used throughout the paper. 
We define simultaneous move games, describe the SM-MCTS algorithm and its modification 
SM-MCTS-A, and afterwards, we discuss existing selection functions and their properties. 

2.1 Simultaneous move games 

A finite two-player zero-sum game with perfect information and simultaneous moves can be 
described by a tuple (A/", "H, C, Z, A, T, Ac, Ui, ho), where M = {1,2} contains player labels, 
% is a, set of inner states, C is the set of chance states and Z denotes the terminal states. 
A = Ai X A 2 is the set of joint actions of individual players and we denote Ai{h) = {1... m^} 
and A 2 {h) = {1... n^} the actions available to individual players in state h ^ T-L. The game 
begins in an initial state ho- The transition function T : % xA\ XA 2 i— T-LUCUZ defines the 
successor state given a current state and actions for both players. For brevity, we sometimes 
denote T{h,i,j) = hij. The chance strategy Ac : C 1 —?■ H x [0,1] determines the next states 
in chance nodes based on a hxed commonly known probability distribution. The utility 
function ui : Z ^ [^min, ^^max] ^ M gives the utility of player 1, with Umin and Umax denoting 
the minimum and maximum possible utility respectively. Without loss of generality we 
assume Umin = 0 and Umax = 1- We assume zero-sum games: Vz e Z, U 2 {z) = —ui{z). 

A matrix game is a single-stage simultaneous move game with action sets Ai and A 2 - 
Each entry in the matrix M = (aij) where {i,j) G Ai x A 2 and Uij G [0,1] corresponds 
to a payoff (to player 1) if row i is chosen by player 1 and column j by player 2. A 
strategy ai G A(Ai) is a distribution over the actions in Ai- If ai is represented as a row 
vector and cj 2 as a column vector, then the expected value to player 1 when both players 
play with these strategies is ui{ai,a 2 ) = aiMa 2 - Given a profile a = (iTi,iT 2 ), dehne 
the utilities against best response strategies to be ui(6r, < 72 ) = max^/( t]^Mcj 2 and 
ui{ai, br) = min^/criMa^. A strategy profile (ui, CJ 2 ) is an e-Nash equilibrium of the 
matrix game M if and only if 

ui( 6 r, CJ2) - ni(cri,cr2) < e and Ui{ai,a2) - ui{ai,br) < e ( 1 ) 

Two-player perfect information games with simultaneous moves are sometimes appropri¬ 
ately called stacked matrix games because at every state h there is a joint action set 
Ai{h) X A 2 {h) that either leads to a terminal state or to a subgame which is itself an¬ 
other stacked matrix game with a unique value, which can be determined by backward 
induction (see Figure]^. 

A behavioral strategy for player i is a mapping from states h G % to a probability 
distribution over the actions Ai{h), denoted ai{h). Given a profile a = (cri,cr 2 ), define the 
probability of reaching a terminal state z under a as vr°'(z) = vrf (zjTrJ( 2 ;)vrc(z), where each 
T^i{z) (resp. TTc(z)) is a product of probabilities of the actions taken by player i (the chance) 
along the path to z. Define Sj to be the set of behavioral strategies for player i. Then for 
any strategy profile a = {ai,a 2 ) G Si x S 2 we define the expected utility of the strategy 
profile (for player 1) as 

u{a) = u{ai,a2) = ^ Tr‘^{z)ui{z) ( 2 ) 

zez 
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Figure 1: Example game tree of a game with perfect information and simultaneous moves. 
Only the leaves contain actual rewards - the values in the inner nodes are achieved by 
optimal play in the corresponding subtree, they are not part of the dehnition of the game. 


An e-Nash equilibrium profile (cri,a 2 ) in this case is defined analogously to Q. In other 
words, none of the players can improve their utility by more than e by deviating unilaterally. 
If (T = (cJi, CJ 2 ) is an exact Nash equilibrium (e-NE with e = 0), then we denote the unique 
value of the game = u{ai,a 2 )- For any h £ Ti, we denote the value of the subgame 
rooted in state h. 


2.2 Simultaneous move Monte Carlo Tree Search 


Monte Carlo Tree Search (MCTS) is a simulation-based state space search algorithm often 
used in game trees. The main idea is to iteratively run simulations to a terminal state, 
incrementally growing a tree rooted at the current state. In the basic form of the algorithm, 
the tree is initially empty and a single leaf is added each iteration. The nodes in the tree 
represent game states. Each simulation starts by visiting nodes in the tree, selecting which 
actions to take based on the information maintained in the node, and then consequently 
transitioning to the successor node. When a node whose immediate children are not all in 
the tree is visited, we expand this node by adding a new leaf to the tree. Then we apply a 
rollout policy (for example, random action selection) from the new leaf to a terminal state 
of the game. The outcome of the simulation is then returned as a reward to the new leaf 
and all its predecessors. 

In Simultaneous Move MCTS (SM-MCTS), the main difference is that a joint action 
of both players is selected and used to transition to a following state. The algorithm has 
been previously applied, for example in the game of Tron (jPerick et al. 2012), Urban 


Rivals (Teytaud and Elory, 2011), and in general game-playing (Finnsson and Bjornsson 


2008). However, guarantees of convergence to NE remain unknown, and Shafiei et al. 


(2009) show that the most popular selection policy (UCB) does not converge, even in a 
simple one-stage game. The convergence to a NE depends critically on the selection and 
update policies applied, which are even more non-trivial in simultaneous-move games than 
in purely sequential games. We describe variants of two popular selection algorithms in 
Section 12.31 

In Figure we present a generic template of MCTS algorithms for simultaneous-move 
games (SM-MCTS). We then proceed to explain how specific algorithms are derived from 
this template. Figure describes a single iteration of SM-MCTS. T represents the incre¬ 
mentally built MCTS tree, in which each state is represented by one node. Every node h 
maintains algorithm-specific statistics about the iterations that previously used this node. 
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SM-MCTS(/i - current state of the game) 
1: h ^ Z then return ui{h) 

2: if /i G C then 
3: Sample h' ~ Ac{h) 

4: return SM-MCTS(/i') 

5: if /i G T then 
6: {ai, a 2 ) Select{h) 

7: h' ^ T{h,ai,a2) 

8 : SM-MCTS(/i') 

9: Update{h,ai,a 2 ,x) 

10: return x 

11: else 

12 : T-^TU{h} 

13: X ^ Rollout (/l) 

14: return x 


Figure 2: Simultaneous Move Monte Carlo Tree Search 


The template can be instantiated by specific implementations of the updates of the statistics 
on line and the selection based on these statistics on line In the terminal states, the 
algorithm returns the value of the state for the first player (line 1). In the chance nodes, 
the algorithm selects one of the possible next states based on the chance distribution (line 
3) and recursively calls the algorithm on this state (line 4). If the current state has a node 
in the current MCTS tree T, the statistics in the node are used to select an action for each 
player (line|^. These actions are executed (line 7) and the algorithm is called recursively on 
the resulting state (line 8). The result of this call is used to update the statistics maintained 
for state h (linej^. If the current state is not stored in tree T, it is added to the tree (line 
12) and its value is estimated using the rollout policy (line 13). The rollout policy is usually 
uniform random action selection until the game reaches a terminal state, but it can also be 
based on domain-specific knowledge. Finally, the result of the Rollout is returned to higher 
levels of the tree. 


This template can be instantiated by choosing a specific selection and update functions. 
Different algorithms can be the bases for selection functions, but the most successful selec¬ 
tion functions are based on the algorithms for multi-armed bandit problem we introduce 
in Section 2.3 The action for each player in each node is selected independently, based on 


these algorithms and the updates update the statistics for player one by ui and for player 
two by U 2 = —ui as if they were independent multi-armed bandit problems. 


SM-MCTS algorithm does not always converge to Nash equilibrium - to guarantee con¬ 
vergence, additional assumptions on the selection functions are required. Therefore, we also 
propose a variant of the algorithm, which we denote as SM-MCTS-A. Later we show that 
this variant converges to NE under more reasonable assumptions on the selection function. 
The difference is that for each node /i G T, the algorithm also stores the number of 
iterations that visit this node and the cumulative reward received from the recursive 
call in these iterations. Every time node h is visited, it increases by one and adds x to 
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X^. SM-MCTS-A then differs from SM-MCTS only on line 9, where the selection functions 
at h are updated by /n^ instead of x. 

We note that in our previous work (Lisy et ah, 2013) we prove a result similar to our 
Theorem here. However, the algorithm that we used earlier is different from SM-MCTS- 
A algorithm described here. In particular, SM-MCTS-A uses averaged values for decision 
making in each node, but propagates backwards the non-averaged values (unlike the previous 
version, which also updates the selection function based on the averaged values, but then 
it propagates backwards these averaged numbers - and on the next level, it takes averages 
of averages and so on). Consequently, this new version is much closer to the non-averaged 
SM-MCTS algorithm used in practice and it has faster empirical convergence. 


2.3 Multi-armed bandit problem 

Multi-armed bandit (MAB) problem is one of the most basic models in online learning. In 
theoretic studies, it is often the basic model for studying fundamental trade-offs between 
exploration and exploitation in an unknown environment (Auer et ah, 1995, 2002). In prac¬ 


tical applications, the algorithms developed for this model has recently been used in online 


advertising (Pandey et ah, 2007), generic optimization (Flaxman et ah, 2005), and most 


importantly for this paper in Monte Carlo tree search algorithms (Kocsis and Szepesvari 


2006; Browne et ah, 2012 Geliy and Silver, 2011 Teytaud and Flory, 2011) 


The multi-armed bandit problem got its name after a simple motivating example con¬ 
cerning slot machines in casinos, also known as one-armed bandits. Assume you have a fixed 
number of coins n you want to use in a casino with K slot machines. Each slot machine has 
a hole where you can insert a coin and as a result, the machine will give you some (often 
zero) reward. Each of the slot machines is generally different and decides on the size of 
the rewards using a different mechanism. The basic task is to use the n coins sequentially, 
one by one, to receive the largest possible cumulative reward. Intuitively, it is necessary 
to sufficiently explore the quality of the machines, but not to use too many coins in the 
machines that are not likely to be good. The following formal definitions use the notation 


from an extensive survey of the field by Bubeck and Cesa-Bianchi (2012). 


Definition 1 (Adversarial multi-armed bandit problem) Multi-armed bandit problem 
is a set of actions (or arms) denoted 1,..., A, and a set of sequences Xi{T) for each action 
i and time step T = 1,2,... . In each time step, an agent selects an action i{T) and receives 
the payoff xnx-){T). In general, the agent does not learn the values Xi{T) for i i{T). 

The adversarial MAB problem is a MAB problem, in which in each time step an ad¬ 
versary selects arbitrary rewards Xi{T) G [0,1] simultaneously with the agent selecting the 
action. 


The algorithms for solving the MAB problem usually optimize some notion of regret. 
Intuitively, the algorithms try to minimize the difference between playing the strategy given 
by the algorithm and playing some baseline strategy, which can possibly use information 
not available to the agent. For example, the most common notion of regret is the external 
regret, which is the difference between playing according to the prescribed strategy and 
playing the fixed optimal action all the time. 


7 



















































Kovari'k and Lisy 


Definition 2 (External Regret) The external regret for playing a sequence of actions 
i(l),..., i{n) is defined as 


t t 

’ ’ S=1 S=1 

By r{t) we denote the average external regret r(T) := jR{t). 

2.3.1 Application to SM-MCTS(-A) 

We now explain how MAB problem applies in the setting of SM-MCTS(-A). We focus on 
the situation for player 1. For a fixed node h € T-L, our goal is to define the MAB reward 
assignment Xi{t) for i G Ai{h), t G N, as they are perceived by the selection function 

Firstly we introduce two auxiliary symbols u^{T) and T^{t): Let T be an iteration 
during which the node h got visited. By u^{T) we denote the value (from line 9 of the 
algorithm on Figure by which the selection function was updated during iteration T 
(this value is either x for SM-MCTS, or X^' jnfi' for SM-MCTS-A). We also set to be 
the iteration during which node h was visited for the t-th time. 

By i{t) G Ai{h) and j{t) G A 2 {h) we denote the actions, which were selected in h during 
iteration T^{t). We can now define the desired MAB reward assignment. By the definition 
of SM-MCTS(-A) algorithm (line 9), the reward Xjp)(t) has to be equal to the t-th observed 
value (T^{t)), thus it remains to define the rewards Xi{t) for i A Intuitively, the 
rewards for these actions should be “the values we would have seen if we chose differently”. 
Formally we set 


Xi{t) 


, where T^{i) is the earliest iteration during 
which h got visited, such that t > t, fit) = i and j{t) = j{t). 


We can see that for i = fit), we have t = t and therefore the definition coincides with the 
one we promised earlier. 

Technical remark: Strictly speaking, it is not immediately obvious that {xfit)), as de¬ 
fined above, is a MAB reward assignment - in MAB problem, the rewards xfit), i G Ai{h) 
have to be defined before the t-th. action is chosen. Luckily, this is not a problem in our 
case - in theory we could compute SM-MCTS(/i') for all possible child nodes h' in advance 
(before line 6), and keep each of them until they are selected. The overall behavior of SM- 
MCTS (-A) would remain the same (except that it would run much slower) and the rewards 
{xfit)) would correspond to a MAB problem. 

In the remainder of this section, we introduce the technical notation used throughout 
the paper. First, we define the notions of cumulative payoff G and maximum cumulative 
payoff Gmax and relate these quantities to the external regret: 

t 

Git) ■= ^Xi(^s){s) 

S=1 
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t 

Gmax(t) := maxy^Xi(s)(s), 

S=1 

R{t) = Giaax{t) — G{t). 

We also define the corresponding average notions and relate them to the average regret: 
g{t) := G{t)/t, granxit) ■= Gma.x{t)/t, r{t) = 5max(i) - git). If there is a risk of confusion as 
to in which node we are interested, we will add a superscript h and denote these variables 
as g^it), g^^^it) and so on. 

Focusing now on the given node h, let i be an action of player 1 and j an action of player 
2. We denote by ti, tj the number of times these actions were chosen up to the t-th visit of 
h and tij the number of times both of these actions has been chosen at once. By empirical 
frequencies we mean the strategy profile a^{t) = dg (t)) given by the formulas 

a’fit){i) = ti/t, = tj/t 

By average strategies, we mean the strategy profile (dj‘(t), (f)) given by the formulas 

t t 

s'um) = E <"1 (»)«/*, simi) = 

S = 1 S = 1 

where cr^is) are the strategies used at h at time s. 

Lastly, by <t(T) we denote the collection of empirical frequencies at all 

nodes /i G H, where t^{T) denotes, for the use of this definition, the number of visits of h 
up to the T-th iteration of SM-MCTS(-A). Similarly we define the average strategy d(T). 
The following lemma says there eventually is no difference between these two strategies. 

Lemma ^ As t approaches infinity, the empirical frequencies and average strategies will 
almost surely be equal. That is, limsup^^^^ maxjg_ 4 ^ |di(t,i) — di(t, i)| = 0 holds with 
probability 1. 

The proof is a consequence of the Strong Law of Large Numbers (and it can be found 
in the appendix). 


2.4 Hannan consistent algorithms 

A desirable goal for an algorithm in MAB setting is the classical notion of Hannan consis¬ 
tency (HC). Having this property means that for high enough t, the algorithm performs 
nearly as well as it would if it played the optimal constant action since the beginning. 

Definition 4 (Hannan consistency) An algorithm is e-Hannan consistent for some e > 
0 i/limsupj_,,o^ r(t) < e holds with probability 1, where the “probability” is understood with 
respect to the randomization of the algorithm. Algorithm is Hannan consistent if it is 0- 
Hannan eonsistent. 


We now present regret matching and Exp3, two of the e-Hannan consistent algorithms 
previously used in MCTS context. The proofs of Hannan consistency of variants of these 


two algorithms, as well as more related results, can be found in a survey by Cesa-Bianchi 


and Lugosi (2006, Section 6). The fact that the variants presented here are e-HC is not 


explicitly stated there, but it immediately follows from the last inequality in the proof of 
Theorem 6.6 in the survey. 
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Input: K - number of actions; 7 - exploration parameter 

1: ViGi ^ 0 

2: for t 1, 2,... do 

exp(^Gi) 


3: 

4 

5 

6 


'^iPi 




Pi ^ (1 - 7)Pi + i 
Use action It from distribution p' and receive reward r 
Gp + ^ 


Figure 3: Exponential-weight algorithm for Exploration and Exploitation (Exp3) algorithm 
for regret minimization in adversarial bandit setting 

2.4.1 Exponential-weight algorithm for Exploration and Exploitation 

The most popular algorithm for minimizing regret in adversarial bandit setting is the 
Exponential-weight algorithm for Exploration and Exploitation (Exp3) proposed by |Auer 
et al.l (2003) and further improved by Stoltz (2005). The algorithm has many different 


variants for various modifications of the setting and desired properties. We present a for¬ 
mulation of the algorithm based on the original version in Figure 

Exp3 stores the estimates of cumulative reward of each action over all iterations, even 
those in which the action was not selected. In the pseudo-code in Figure we denote 
this value for action i by Gi. It is initially set to 0 on line 1. In each iteration, a prob¬ 
ability distribution p is created proportionally to the exponential of these estimates. The 
distribution is combined with a uniform distribution with probability 7 to ensure sufficient 
exploration of all actions (line 4). After an action is selected and the reward is received, 
the estimate for the performed action is updated using importance sampling (line 6 ): the 
reward is weighted by one over the probability of using the action. As a result, the expected 
value of the cumulative reward estimated only from the time steps where the agent selected 
the action is the same as the actual cumulative reward over all the time steps. 


2.4.2 Regret matghing 

An alternative learning algorithm that allows minimizing regret in stochastic bandit setting 


is regret matching (Hart and Mas-Colell, 2001), later generalized as polynomially weighted 


average forecaster (Cesa-Bianchi and Lugosi, 2006). Regret matching (RM) corresponds to 


selection of the parameter p = 2 in the more general formulation. It is a general procedure 
originally developed for playing known general-sum matrix games in ( ]Hart and Mas-Colell 
2000). The algorithm computes, for each action in each step, the regret for not playing 


another fixed action every time the action has been played in the past. The action to be 
played in the next round is selected randomly with probability proportional to the positive 
portion of the regret for not playing the action. This procedure has been shown to converge 
arbitrarily close to the set of correlated equilibria in general-sum games. As a result, it 
converges to a Nash equilibrium in a zero-sum game. The regret matching procedure in 


Hart and Mas-Colell (2000) requires the exact information about all utility values in the 


game, as well as the action selected by the opponent in each step. In Hart and Mas-Colell 


( 2001 ), the authors modify the regret matching procedure and relax these requirements. 
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Input: K - number of actions; 7 - the amount of exploration 
1 : Vi ^ 0 
2: for t 1, 2,... do 
3: Vi ■(—max{0,-Rj} 

4: if R'j then 

5: \/i Pi IjK 

6: else 

7: Mi Pi ^ (1 -7f) 

8 : Use action It from distribution p and receive reward r 

9: Mi Ri ^ Ri — r 

10: Ri^ •(— Rp + ^ 

Figure 4: regret matching variant for regret minimization in adversarial bandit setting. 


Instead of computing the exact values for the regrets, the regrets are estimated in a similar 
way as the cumulative rewards in Exp3. As a result, the modihed regret matching procedure 
is applicable in MAB. 

We present the algorithm in Figure]^ The algorithm stores the estimates of the regrets 
for not playing action i in all time steps in the past in variables Ri. On lines 3-7, it computes 
the strategy for the current time step. If there is no positive regret for any action, a uniform 
strategy is used (line 5). Otherwise, the strategy is chosen proportionally to the positive 
part of the regrets (line 7). The uniform exploration with probability 7 is added to the 
strategy as in the case of Exp3. It also ensures that the addition on line 10 is bounded. 


Cesa-Bianchi and Lugosi (2006) prove that regret matching eventually achieves zero 


regret in the adversarial MAB problem, but they provide the exact finite time bound only 
for the perfect-information case, where the agent learns rewards of all arms. 


3. Convergence of SM-MCTS and SM-MCTS-A 


In this section, we present the main theoretic results. Apart from a few cases, we will only 
present the key ideas of the proofs here, while the full proofs can be found in the appendix. 
We will assume without loss of generality that the game does not contain chance nodes 


(that is, C = 0); all of our results (apart from those in Section 3.2) are of an asymptotic 


nature, and so they hold for general nonempty C, since we can always use the law of large 
numbers to make the impact of chance nodes negligible after sufficiently high number of 
iterations. We choose to omit the chance nodes in our analysis, since their introduction 
would only require additional, purely technical, steps in the proofs, without shedding any 
new light on the subject. For an overview of the notation we use, see Table 

In order to ensure that the SM-MCTS(-A) algorithm will eventually visit each node we 
need the selection function to satisfy the following property. 


Definition 5 We say that A is an algorithm with guaranteed exploration if, for players 
1 and 2 both using A for action selection, lim 4 _ 5 .oo Uj = 00 holds almost surely for each 
{i,j) G Ml X M 2 . 
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he'H,A,D 

game nodes, action space, depth of the game tree 

u,v,v^,dh 

utility, game value, subgame value, node depth 

a, a, a, br 

strategy, empirical st., average st., best response 

NE, HC 

Nash equilibrium, Hannan consistent 

UPO 

Unbiased payoff observations 

SM-MCTS(-A) 

(averaged) simultaneous-move Monte Carlo tree search 

MAB 

multi-armed bandit 

i{t) (or also a{t)) 

action chosen at time t 

ti, tij 

uses of action i (joint action (i,j)) up to time t 

Xi{t) 

reward assigned to an action i at time t 

r{t),R{t) 

(average) external regret at time t 

LI, g, Cjnaxj 17max 

cumulative payoff (average, maximum, maximum average) 

Exp 3 

Exponential-weight algorithm for Exploration and Exploitation 

RM 

regret matching algorithm 

CFR 

an algorithm for counterfactual regret minimization 

7 

exploration rate 

C,c 

positive constants 

V 

arbitrarily small positive number 

expl 

exploitability of a strategy 

P 

empirical strategy with removed exploration 

I 

indicator function 


Table 1: The most common notation for quick reference 


It is an immediate consequence of this definition that when an algorithm with guaranteed 
exploration is used in SM-MCTS(-A), every node of the game tree will be visited indefinitely. 
From now on, we will therefore assume that, at the start of our analysis, the full game tree 
is already built - we do this, because it will always happen after a finite number of iterations 
and, in most cases, we are only interested in the limit behavior of SM-MCTS(-A) (which is 
not affected by the events in the first finitely many steps). 

Note that most of the HC algorithms, namely RM and Exp3, guarantee exploration 
without the need for any modifications. There exist some (mostly artificial) HC algorithms, 
which do not have this property. However, they can always be adjusted in the following 
way. 

Definition 6 Let A he an algorithm used for ehoosing action in a matrix game M. For 
fixed exploration parameter 7 G (0,1) we define modified algorithm A* as follows: For time 
s = 1,2,either explore with probability 7 or run one iteration of A with probability 1 — 7 , 
where “explore” means we choose the action randomly uniformly over available actions, 
without updating any of the variables belonging to A. 
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Fortunately, e-Hannan consistency is not substantially influenced by the additional explo¬ 
ration: 

Lemma 7 Let A he an e-Hannan consistent algorithm. Then A* is an (e -|- 'y)-Hannan 
consistent algorithm with guaranteed exploration. 


3.1 Asymptotic convergence of SM-MCTS-A 

The goal of this section is to prove the following Theorem We will do so by backward 
induction, stating firstly the required lemmas and definitions. The Theorem itself will 
then follow from the Corollary |12[ 

Theorem 8 Let G be a zero-sum game with perfect information and simultaneous moves 
with maximal depth D and let A be an e-Hannan consistent algorithm with guaranteed 
exploration, which we use as a selection policy for SM-MCTS-A. 

Then for arbitrarily small tj > 0, there almost surely exists to, so that the empirical 
frequencies {d'i{t), d' 2 {t)) form a subgame-perfect 

{2D {D -\- 1) e + r]) -equilibrium for all t > to- 


In other words, the average strategy will eventually get arbitrarily close to Ce-equilibrium. 
In particular a Hannan-consistent algorithm (e = 0) will eventually get arbitrarily close 
to Nash equilibrium. This also illustrates why we cannot remove the number r], as even 
a HC algorithm might not reach NE in finite time. In the following ry > 0 will denote an 
arbitrarily small number. As ry can be chosen independently of everything else, we will not 
focus on the constants in front of it, writing simply ry instead of 2ry etc. 

It is well-known that two Hannan consistent players will eventually converge to NE in a 
matrix game - see Waugh (2009) and Blum and Mansour (2007). We prove a similar result 


for the approximate versions of the notions. 


Lemma 9 Let e > 0 be a real number. If both players in a matrix game M are e-Hannan 
consistent, then the following inequalities hold for the empirical frequencies almost surely: 


V — e < liminf g{t) < limsup(y(t) < u -|- e, (3) 

I—^oo 


u — 2e < liminf rt ((Ti(t), 5r) & limsupu {br,d' 2 {t)) < v-\-2e. (4) 

t^oo 

The inequalities (§ are a consequence of the definition of e-HC and the game value v. 
The proof of inequality (|^ then shows that if the value caused by the empirical frequencies 
was outside of the interval infinitely many times with positive probability, it would be in 
contradiction with definition of e-HC. Next, we present the induction hypothesis around 
which the proof of Theorem [^revolves. 

Induction hypothesis {IHu) '■ For a node h in the game tree, we denote by dh the depth 
of the tree rooted at h (not including the terminal states - therefore when d/j = 1, the node 
is a matrix game). Let d G {1,..., dj-QQi^}. Induction hypothesis {IHu) is then the claim 
that for each node h with dh = d, there almost surely exists to such that for each t > to 
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1. the payoff g^{t) will fall into the interval — Cde,v^ + C^e); 

2. the utilities u {ai{t),br) < u{br,a 2 {t)) with respect to the matrix game , will 
fall into the interval — 2Cde, + 2Cde ); 

where Cd = d + rj and is the value of subgame rooted at the child node of h indexed by 

ij- 

Note that Lemmaensures that {IHi) holds. Our goal is to prove 2. for every h ^H, 
which then implies the main result. However, for the induction itself to work, the condition 
1. is required. We now introduce the necessary technical tools. 

Definition 10 Let M = {aij) be a matrix game. For t G N we define M{t) = {aij{t)) to 
be a game, in which if players chose actions i and j, they observe (randomized) payoffs 
Oij {t, (i(l), — 1)), 0(1), ■■■j{t — 1))). We will denote these simply as aij{t), but in fact 

they are random variables with values in [0,1] and their distribution in time t depends on 
the previous choices of actions. 

We say that M{t) = {aij{t)) is a repeated game with error e, if there almost surely 
exists to G N, such that \aij{t) — aij\ < e holds for some matrix {aij) and all t > to. By 
symbols G(t), R(T), r{t) (and so on) we will denote the payoffs, regrets and other variables 
related to the distorted payoffs aij{t). On the other hand, by symbol u{a) we will refer to 
the utility of strategy a with respect to the matrix game (aij). 

The intuition behind this definition is that the players are repeatedly playing the original 
matrix game M - but for some reason, they receive imprecise information about their payoffs. 
The application we are interested in is the following: we take a node h inside the game tree. 
The matrix game without error is the matrix game M = (vij), where Vij are the values of 
subgames nested at h. By [IHdy^-i), the payoffs received in h during SM-MCTS-A can be 
described as a repeated game with error, where the observed payoffs are g^*L 

The following proposition is an analogy of Lemma for repeated games with error. It 
shows that an e-HC algorithms will still perform well even if they observe slightly perturbed 
rewards. 

Proposition 11 Let M = (u^) be a matrix game with value v and e, c > 0. If M{t) is 
corresponding repeated game with error ce and both players are e-Hannan consistent, then 
the following inequalities hold almost surely: 

u — (c + l)e < liminf g(t) < limsup g{t) < u + (c + l)e, (5) 

t^oo 

v — 2(c + l)e < lim inf u (di, br) < lim sup u {br, df) < u + 2(c + l)e. (6) 

The proof is similar to the proof of Lemma It needs an additional claim that if the 
algorithm is e-HC with respect to the observed values with errors, it still has a bounded 
regret with respect to the exact values. 

Corollary 12 {IHu) {IHu+i). 
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Proof Property 1. of {IHd) implies that every node h with dh < d is a repeated game with 
error de + rj. Proposition 11 then implies that any h with d^ = d + 1 is again a repeated 
game with error, and by inequality ([^ the value of error increases to (d + 1) e + rj, which 
gives (IHd+i). ■ 


Recall here the following well-known fact: 

Remark 13 In a zero-sum game with value v the following implication holds: 

(^ui{br,d 2 ) < f + I and ui{di,br) > 

def 

{ui{br,a 2 ) - ui{di,d 2 ) < e and U 2 {di,br) - U 2 {di,d 2 ) < e) <;=> 

(di,(T 2 ) is an e-equilibrium. 

The following example demonstrates that the above implication would not hold if we 
replaced e/2 by e. Consider the following game 


0.4 

0.5 

0.6 

0.5 


with a strategy profile (1,0), (1,0). The value of the game is v = 0.5, u{br, (1,0)) = 0.6 and 
ti((l, 0), br) = 0.4. The best responses to the strategies of both players are 0.1 from the game 
value, but (1,0), (1,0) is a 0.2-NE, since player 1 can improve by 0.2. 

Proof [Proof of Theorem First, we observe that by Lemma {IHi) holds, and conse¬ 


quently by Corollary 12 (IHd) holds for every d = 1 ,..., D. Denote by u^{a) (resp. u^Aa)) 


the expected payoff corresponding to the strategy a used in the subgame rooted at node 
h ^ TL (resp. its child). Remark 13 then states that, in order to prove Theorem]^ it is 
enough to show that for every h G H, the strategy d (t) will eventually satisfy 


{br, (T 2 (t)) <v^ -\- {dh -\- 1) dhC + rj. 


(7) 


We will do this by backward induction. The property 1. from {IHi) implies that the 
inequality 0 holds for nodes h with dh = 1. Let 1 < d < D, h G H he such that dh = d 
and assume, as a hypothesis for backward induction, that the inequality Q holds for each 
h' with dh' < d. We observe that 

u^{br,d 2 {t)) = maxy^ d 2 {t) {j) {br, d 2 {t)) 


< v^+ l^max d2 {t) {)) -h 

( 7 ) (uij (br, <T 2 {t)) - . 


-|- max 
i 


By property 2. in {IHd) tbe first term in the brackets is at most 2de I- rj. By the backward 
induction hypothesis we have 


Uii {br, 02 {t)) - u/- <d{d-l)€ + r] 
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Therefore we have 


{hr, a 2 (t)) <v^ + 2de + d {d — 1) e + rj = + {d + 1) de + r]. 

For d = D and h = root, Remark [T^ implies that (di {t ), cj 2 (t)) will form {2D (D + 1) e + r])- 
equilibrium of the whole game. ■ 


3.2 SM-MCTS-A finite time bound 


In this section, we find a probabilistic finite time bound on the performance of HC algorithms 
in SM-MCTS-A. We do this by taking the propositions from Section 3.1 and working with 
their quantified versions. 


Theorem 14 (Finite time bound for SM-MCTS-A) Consider the following setting: 
A game with at most b actions at each node h G H and depth D, played by SM-MCTS- 
A using an e-Hannan consistent algorithm A with exploration 7 . Fix <5 > 0. Then with 
probability at least 1 — (2 {Hi + D) 6, the empirical frequencies will form an AD {D -\- 1) e- 
equilibrium for every t > Tq, where 

/h\ 

To = f-J log (2 \H\ - 2) Ta (e, 5) 


and Ta (e, 6) is the time needed for A to have with probability at least 1 — 5 regret below e 
for all t > TA{e, 5). 

We obtain this bound by going through the proof of Theorem in more detail, replacing 
statements of the type “inequality of limits holds” by “for all t > to a slightly worse 
inequality holds with probability at least 1 — (5”. We also note that the actual convergence 
will be faster than the one stated above, because the theorem relies on quantification of the 
guaranteed exploration property (necessary for our proof), rather than the fact that MCTS 
attempts to solve the exploration-exploitation problem (the major reason for its popularity 
in practice). 


3.3 Asymptotic convergence of SM-MCTS 

We would like to prove an analogy of Theoremfor SM-MCTS. Unfortunately, such a goal 
is unattainable in general - in Sectionwe present a counterexample, showing that a such a 
theorem with no additional assumptions does not hold. Instead we define, for an algorithm 
A, the property of having e-unbiased payoff observations (e-UPO, see Definition 18) and 
prove the following Theorem for e-HC algorithms with this property. We were unable to 
prove that specific e-HC algorithms have this e-UPO property, but instead, later in Section 
we provide empirical evidence supporting our hypothesis that the “typical” algorithms, 
such as regret matching or Exp3, indeed do have e-unbiased payoff observations. 


Theorem 15 Let A be an e-HC algorithm with guaranteed exploration that has e-UPO. If 
A is used as selection policy for SM-MCTS, then the average strategy of A will eventually 
get arbitrarily close to Ce-NE of the whole game, where C = 12 {2^ — l) — 8 D. 
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We now present the notation required for the definition of the e-UPO property, and then 
proceed to the proof of Theorem |15[ As we will see, the structure of this proof is similar to 


the structure of Section 3.1, but some of the propositions have slightly different form. 


3.3.1 Definition of the UPO property 

Notation 16 Let h gH be a node. We will take a closer look at what is happening at h. 
Let hij be the children ofh. Since the events in h and above do not affect what happens in hij 
(only the time when does it happen), we denote by Sij (1), Sij (2),the sequence of payoffs 
we get for sampling hij for the first time, the second time and so on. The correspondence 

between these numbers Sij and the payoffs Xij observed in h is Xij (t) = Sij (^{t — 1)^^- + 1^, 
where {t — l)ij is the number of uses of joint action {i,j) up to time t — 1. 

Note that all of these objects are, in fact, random variables and their distribution de¬ 
pends on the used selection policy. By Sij (n) = ^ Z]m=i denote the standard 

arithmetical average of Sij. Finally, setting t*j (k) = min{tGN|tij = k}, we define the 
weights wij (k) and the weighted average Sij (k): 

Wij (n) = 1 + I{t G N| (n — 1) < t < t*j (n) &: t satisfies j{t) = j but i{f) i)\, 

1 " 

Sij {m). 

Z^m=I V'd 


Remark 17 (Motivation for the definition of UPO property) If our algorithm A is 
e-HC, we know that if hij is a node with dh^j = 1 and value Vij, then limsup^ |% (n) — Vij\ < 
e (Lemma^ Q, where g (n) = Sij (n) ). In more vague words, “we have some information 
about Sij”, therefore, we would prefer to work with these “simple” averages. Unfortunately, 
the variables, which naturally appear in the context of SM-MCTS are the “complicated” 
averages Sij - we will see this in the proof of Theorem 15 and it also follows from the fact 
that, in general, there is no relation between quality the performance of SM-MCTS and the 
value of differences Sij (n) — Vij 
following definition: 


(see Section 5.2 for a counterexample). This leads to the 


Definition 18 (UPO) We say that an algorithm A guarantees e-unbiased payoff observa¬ 
tions, if for every (simultaneous-move zero-sum perfect information) game G, every node h 
and actions i, j, the arithmetic averages Sij and weighted averages Sij almost surely satisfy 

limsup \sij (n) — Sij (n)| < e. 

t^OO 

We will sometimes abbreviate this by saying that “A is e-UPO algorithm”. 


Observe that this in particular implies that if, for some c > 0, 

lim sup I Sij (n) — Vij \ < ce 

n—^oo 

holds almost surely, then we also have 

limsup \sij (n) — Vij\ < (c + 1) e a.s.. 

n^oo 
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Next, we present a few examples which motivate the above definition and support the 
discussion that follows. 

Example 1 (Examples related to the UPO property) 

1. Suppose that Wij{n), Sij{n),n G N do not necessarily originate from SM-MCTS algo¬ 
rithm, but assume they satisfy: 

(a) Wij{n), Sij{n), re G N are independent 

(b) 3C > 0 Vre G N : Wij{n) G [0, C] 

(c) Vre G N : Sij(re) G [0,1] &: E[|sjj(re) — VijW < | for some Vij G [0,1]. 

Then, by strong law of large numbers, we almost surely have 

limsup|sjj(re) — Sij{n)\ < e. (8) 

n^oo 

2. The previous case can be generalized in many ways - for example it is sufficient to 
replace bounded Wij{n) by ones satisfying 

3q G (0,1) Vre \/i,j : Pr[t(;jj(re) > k] < 

(an assumption which holds with q = 'y/\Ai{h)\ when Wij{n), Sij{n) originate from 
SM-MCTS with fixed exploration). Also, the variables Wij{n), Sij{n) do not have to 
be fully independent - it might be enough if the correlation between each Sij{n) and 
Wij{n) was “low enough for most re G N”. 

3. In Section we provide empirical evidence, which suggests that when the variables 
Sij{n), Wij{n) originate from SM-MCTS with Exp3 or RM selection policy, then the 
assertion Q of 1. holds as well (and thus these two e-HC algorithms are e-UPO). 

4- Assume that {sij{n))ff3i = (1,0,1, 0,1,...) and {wij{n))f3i = (1,3,1,3,1,...). Then 
we have Sij{n) —)■ but Sij{n) —)■ 

5. In Section\5^we construct an example of e-HC algorithm, based on 4-, such that when 
it is used as a selection policy in a certain game, we have limsup|sij(re) — Sij{n)\ > 

n^oo 

The cases 2. and 3. from Example suggest that it is possible to prove that specific e-HC 
algorithms are e-UPO. On the other hand, 5. shows that the implication (A is e-HC 
A is Ce-UPO) does not hold, no matter how high C > 0 we choose. Also, the guarantees we 
have about the behavior of, for example, Exp3 are much weaker than the assumptions made 
in 1 - there is no independence between Wij{n), Sij{m), rre, re G N, at best we can use some 
martingale theory. Moreover, even in nodes h G TL with dh = 1, we have limsup|sij(re) — 

n^oo 

Vij\ < e, instead of assumption (c) from 1.. This implies that the proof that specific e-HC 
algorithms are e-UPO will not be trivial. 
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3.3.2 The proof of Theorem [15] 

The following proposition shows that if the assumption holds, then having low regret in 
some h € T-L with respect to observed rewards is sufficient to bound the regret with respect 
to the rewards originating from the matrix game (vij). 


Proposition 19 Let h ^ LL and e,c > 0. Let A be an e-HC algorithm which generates 
the sequence of actions {i{t)) at h and suppose that the adversary chooses actions {j{t)). If 
limsup \sij (n) — Vij\ < ce holds a.s. for each i,j and A is e-UPO, then we almost surely 


n^oo 

have 


1 


limsup - max^ Vi{s)j{s) < 2 (c + 1) e. 

t^oo I \ *(0) — — ' 


(9) 


S = 1 


S = 1 


Consequently the choice of actions {i{t)) made by the algorithm A is 2{c+ l)e-HC with 
respect to the matrix game (vij). 

The proof of this proposition consists of rewriting the sums in inequality Q and using the 


fact that the weighted averages Sij are close to the standard averages Sij. Denote by ylH^ 
the claim, which is the same as (IHd) from paragraph |3 .1 1 except that it concerns SM-MCTS 
algorithm rather than SM-MCTS-A and Cd = 3-2'^“^ — 2. Lemmathen immediately gives 


the following corollary. Analogously to the Section 3.1 this in turn implies the main theorem 


of this section, the proof of which is similar to the proof of Theorem 


Corollary 20 (iH 


IH. 


d+l 


Proof By Lemmathe implication holds for some constants Cd- It remains to show that 
Cd = 3 ■ 2'^“^ — 2. We proceed by backward induction - since the algorithm A is e-HC, 
we know that, by Lemma 9 holds with Ci = 1. For d > 2, Proposition 

Cd+i = 2 {Cd -|- 1). A classical induction then gives the result. 
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implies 


Proof [Proof of theorem Using Corollary 20, the proof is identical to the proof of 
Theorem [^- it remains to determine the new value of C. As in the proof of Theorem we 


have (7 = 2-2 Xl^i ^d, and we need to calculate this sum: 


D 


D 


^ (7d = ^ (3 • 2'^“^ - 2 j = 3 {l + ... + 2 ^-^) - 2D = 3 (2^ - 1 ) - 2 D . 


d=l 


d=l 


4. Exploitability and exploration removal 

One of the most common measure of the quality of a strategy in imperfect information 
games is the notion of exploitability (for example, [Johanson et ah 2011). It will be useful 
for the empirical evaluation of our main result in Section as well as for the discussion of 
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lower bounds in Section In this section, we first recall the definition of this notion and 
we follow with few observations concerning which strategy should be considered the output 
of SM-MCTS(-A) algorithms. 

Definition 21 Exploit ability of strategy ai of player 1 is the quantity 

expli (cTi) := V — u (cii, br ), 

where v is the value of the game and br is a second player’s best response strategy to ai. 
Analogously we define expl 2 for the second player’s strategies. 

Clearly we always have explj (ai) > 0, i = 1,2 and a strategy profile a = (cJi, ( 12 ) is a Nash 
equilibrium iff expl^ (cJi) = expl 2 (cr 2 ) = 0 . 


Remark 22 (Removing the exploration) In SM-MCTS(-A) we often use a selection 
function with fixed exploration parameter 7 > 0 , such that the algorithm is guaranteed to 
converge to C'y-equilibrium for some constant C > 0 (for example Exp3 or regret matching). 

suggest removing the random noise caused by this exploration 
from the resulting strategies, but they do it heuristically and do not formally analyze this 
procedure. By definition of exploration, the average strategy (d'i(t), a' 2 (t)) produced by SM- 
MCTS(-A) algorithms is of the form 

ai (t) = (1 - 7 ) Pi (t) + 7 • rnd 

for some strategy pi {t), where rnd is the strategy used when exploring, assigning to each 
action the same probability. 

In general, rnd will not be an equilibrium strategy of our game. This means that for 
small values of 7 and high enough t, so that the algorithms have time to converge (that is 
when ai (t) is reasonably good), we have 


Teytaud and Flory (2011 


expli {rnd) > C'^ > expli (^)) ■ 


And finally since the function expf is linear, we have 

C-f > expli {ai {t)) 

= (1 - 7 ) exp/j {pi (t)) + 7 • expli {rnd) 
> ( 1 - 7 ) expli {pi {t)) + 7 • C'T- 

This necessarily implies that expli iPi (0) ^ expf {ai {t)). 


We can summarize this remark by the following proposition (the proof of which consists of 
using the fact that utility is a linear function): 

Proposition 23 Let a {t) = {ai {t ), ^2 {t)) be the average strategy. Let 7 > 0 and set 

Then the following holds: 

( 1 ) expli {pi {t)) < expli {ai (t)) + 7 / (1 - 7 ). 

(2) If expli (rnd) > C 7 > expli {ai {t)) holds for some C > G, then the strategy pi {t) satisfies 
expli {pi {t)) < expli {ai {t)). 
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tt4 Us U2 Ui 



0 0 0 0 0 


Figure 5: A single-player game where the quality of a strategy has linear dependence on 
the exploration parameter and on the game depth D. The numbers satisfy 0 < ui < 
U2 < • • • < ud < 1- 


Less formally speaking, there are two possibilities. First is that our algorithm had so 
little time to converge that it is better to disregard its output Ui {t) and play randomly 
instead. If this is not the case, then by (2) it is always better to remove the exploration 
and use the strategy pi{t) instead of at {t). And by ( 1 ), even if we remove the exploration, 
we cannot increase the exploitability of pi{t) by more than 7/(1 — 7). We illustrate this by 
experiments presented in Section where we compare the quality of strategies pi {t) and 
ai (t). 


5. Counterexample and lower bounds 


In this section we first show that the dependence of constant C from Theorems and 15 


on the depth D of the game tree cannot be improved below linear dependence. Main result 


of this section is then an example showing that, without the e-UPO property. Theorem 15 
does not hold. 


5.1 Dependence of the eventnal NE distance on the game depth 


Proposition 24 There exists k > 0, such that none of the Theorems\^ and 15 hold if the 
constant C is replaced hy C = kDe. This remains true even when the exploration is removed 
from the strategy a. 


The proposition above follows from Example 


Example 2 Let G be the single player gam^from Figure 7 > 0 some small number, 
and D the depth of the game tree. Let Exp3 with exploration parameter 'j = ke be our e-HC 
algorithm (for a suitable choice ofk). We recall that this algorithm will eventually identify 
the optimal action and play it with frequency 1 — 7, and it will choose randomly otherwise. 
Denote the available actions at each node as (up, right, down), resp. (right, down) at the 
rightmost inner node. We define each of the rewards u^, d = 1,..., D — 1 in such a way that 
Exp3 will always prefer to go up, rather than right. By induction over d, we can see that 
the choice ui = 1 — 7/2 -|- rj, u^+i = (1 — 'y/3)ud is sufficient and for 7 small enough, we 
have 


UD-l 




7xD-i D-1 

3 ) 


1. The other player always has only a single no-op action. 
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(where by = we mean that for small 7 , in whieh we are interested, the difference between the 
two terms is negligible). Consequently in each of the nodes, Exp3 will converge to the strategy 
(1 — I 7 , ^ 7 , 17 ) (resp. (1 — ), which yields the payoff of approximately — 

Clearly, the expected utility of such a strategy is approximately 


u = {l -7/3)'^ = 1 - — J- 


D 


On the other hand, the optimal strategy of always going right leads to utility 1, and thus 
our strategy a is ^'-y-equilibrium. 

Note that in this particular example, it makes no difference whether SM-MCTS or SM- 
MCTS-A is used. We also observe that when the exploration is removed, the new strategy 
is to go up at the first node with probability 1, which again leads to regret of approximately 

3 ^ 


By increasing the branching factor of the game in the previous example from 3 to 6 (adding 
more copies of the “ 0 ” nodes) and modifying the values of Ud accordingly, we could make 
the above example converge to 2 ^^Z) 7 -equilibrium (resp. once the exploration is 

removed). 

In fact, we were able to construct a game of depth D and e-HC algorithms, such that the 
resulting strategy a converged to 3Z)e-equilibrium [2De after removing the exploration). 
However, the e-HC algorithms used in this example are non-standard and would require 
the introduction of more technical notation. Therefore, since in our main theorem we use 
quadratic dependence C = kD"^, we instead choose to highlight the following open question: 


Problem 25 Does Theorem^ (and Theorem 15) hold with C 
the presented bound tight)? 


kD for some k > 0 (or is 


It is our hypothesis that the answer is affirmative (and possibly the values k = 3, resp. 
k = 2 after exploration removal, are optimal), but the proof of such proposition would 
require techniques different from the one used in the proof of Theorem 


5.2 Counterexample for Theorem 15 


Recall that in Section]^ we proved two theorems of the following form: 


Proposition 26 Let A be an e-HC algorithm with guaranteed exploration and let G be 
(zero-sum simultaneous moves perfect information) game. If A is used as selection policy 
for SM-MCTS(-A), then the empirical frequencies will eventually get arbitrarily close to 
Ce-NE of the whole game, for some C > 0. 


The goal of this subsection is to prove the following theorem: 


Theorem 27 There exists a simultaneous move zero-sum game G with perfect information 
and a 0-HC algorithm A with guaranteed exploration, such that when A is used as a strategy 
for SM-MCTS algorithm (rather than SM-MCTS-A), then the average strategy & (t) almost 
surely does not converge to the set of\-Nash equilibria ofG. 
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This in particular implies that no theorem similar to Proposition 26 holds for SM-MCTS, 
unless A satisfies some additional assumptions, such as being an e-UPO algorithm. 


We now present some observations regarding the proof of Theorem 27 


Remark 28 Firstly, it is enough to find the game G and eonstruct for each e > 0 an 
algorithm whieh is e-HC, but a (t) does not converge to the set of \-NE. From these 
e-HC algorithms A,,, the desired 0-HC algorithm can he constructed in a standard way - that 
is using 1-Hannan eonsistent algorithm Ai for some period ti, then \-HC algorithm ^ 1/2 
for a longer period t 2 and so on. By ehoosing a sequenee which increases quickly 

enough, we can guarantee that the resulting combination of algorithms is 0-Hannan 

consistent. 

Furthermore, we can assume without loss of generality that the algorithm A knows if 
it is playing as the first or the second player and that in each node of the game, we can 
actually use a different algorithm A. This is true, because the algorithm always accepts a 
number of available actions as input. Therefore we could define the algorithm differently 
based on this number, and modify our game G in some trivial way (such as duplicating rows 
or columns) which would not affect our example. 


The structure of the proof of Theorem 27 is now as follows. First, in Example we 
introduce game G and a sequence of joint actions leading to 


= 0 & r°(T) = f 

h&n 


This behavior will serve as a basis for our counterexample. However, the “algorithms” 
generating this sequence of actions will be oblivious to the actions of opponent, which means 
that they will not be e-HC. In the second step of our proof, we modify these algorithms in 
such a way that the resulting sequence of joint actions stays similar to the original sequence, 
but the new algorithms are e-HC. Theorem then follows from Lemma and Remark 

M 


Example 3 The game: Let G be the game from Figure 

Behavior at J: At the node J, the players repeat (not counting the iterations when the 
play does not reach J) the pattern (U,L), (U,R), (D,R), (D,L), generating payoff sequence 

SYil),SY (2),... = 1,0, 1,0,.... 

Looking at time steps of the form t = Ak, fc G N, the average strategy J will then be 
— ^2 — 5 ) corresponding payoff of the maximizing player 1 is Note that 

neither of the players could improve his utility at J by changing all his actions to any single 
action, therefore for both players, we have r'^ {f) = 0. 

Behavior at I: Let T = Ak. At the node I, player 1 repeatedly plays Y,X,X,Y,.... For 
iteration t and action a, we denote by Xa{t) the reward we would receive if we played a at 
node I at time t, provided we repeated the Y, X, X, Y pattern up to iteration t — 1 and used 
the above defined behavior at J (formally we have xx{t) = 0,xy(t) = sy ((t — l)y-|-1), 
where, as always, {t — l)y denotes the number of uses of action Y up to time t — 1). Denote 
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by a{t) the action played at time t. The payoffs Xa{t) we actually do receive will then form 
a 4-P^fiodic sequence 


XY (1) = 1, XX (2) = 0 , XX (3) = 0 , XY (4) = 0. 

Clearly if we change the strategy from the current = ( 5 , 5 ) to (0,1), we would receive an 
average payoff |. This means that the average overall regret of the whole game G for player 
1 is equal to r^{T) = However, if we look only at the situation at node I and represent 
it as a bandit problem, we see that the payoff sequence xy (1) ,xy (2),... for action Y will 
be 1,0,0,0, 1 , 0, 0, 0,... (while xx (t) = 0 for each t). At first, this might seem strange, but 
note that the reward for action Y does not change when X is chosen. This implies that, 
from the MAB point of view, the player believes he cannot receive an average payoff higher 
than I and thus he observes no regret and r\T) = 0. 

Remark 29 Recall here the definition of e-UPO property of an algorithm, which requires 
the “observed average payoffs” Sa{t) for all actions a to be close to the real average payoffs 
Sa{t). In this case, we have SB{t) = | and SB{t) = 4, which means that the above algorithm 
is far from being e-UPO. 


X Y 



U 

D 


Figure 6 : Example of a game in which it is possible to minimize regret at each of the nodes 
while having high overall regret. 


Lemma 30 Let G be the game from Figure Then for each e > 0 there exist e-HC 
algorithms Af A'l, A'^, such that when these algorithms are used for SM-MCTS in G, the 
resulting average strategy a (t) converges to = af = &2 = (51 2 ) • 


As noted in Example]^ the strategy a satisfies ui {&) = while the equilibrium strategy 
vr, where = (0,1), tt/ = ^), gives utility ui (vr) = 4. Therefore the existence of 


algorithms from Lemma 30 proves Theorem [ 

The key idea behind Lemma is the following: both players repeat the pattern from 
Example but we let them perform random checks which detect any adversary who deviates 
enough to change the average payoff. If the players repeat the pattern, by the previous 
example they observe no regret at any of the nodes. On the other hand, if one of them 
deviates significantly, he will be detected by the other player, who then switches to a “safe” 


24 








HC SELECTION FOR MCTS IN SIMULTANEOUS MOVE GAMES 


e-HC algorithm, leading again to a low regret. The definition of the modified algorithms 
used in Lemma 30, along with the proof of their properties, can be found in the appendix. 


We recall that there exists an algorithm, called CFR (Zinkevich et ah, 2007), which 
provably converges in our setting. The following remark explains why the proof of its 
convergence cannot be simply modified to work for SM-MCTS(-A), but a new proof had to 
be found instead. 


Remark 31 (CFR and bounding game regret by sum of node regrets) The conver¬ 
gence of CFR algorithm relies on two facts: firstly, in each node h € T-L, the algorithm mini¬ 
mizes so called average immediate counterfactual regret, which we denote here by R^fl^{T)/T. 
Seeondly, the overall average regret in the whole game, which we denote by r^{T), can he 
bounded by the sum of “local” regrets in the game nodes: 


'(r)< 


nh, 

/ j ^imm 

h^n 


{T)/T 


( 10 ) 


(Theorem 3 by Zinkevieh et al, 2001). R is then well known that when both players have 


low overall regret r^{T), the average strategy is close to an equilibrium. 

We now look at the similarities between this situation for CFR and for SM-MCTS. e-HC 
algorithms, used by SM-MCTS, guarantee that the average regret r^{t) is, in the limit, at 
most € at every h € H. In other words, SM-MCTS also minimizes some kind of regret in 
each of the nodes h £ I, like CFR does. It is then logical to ask whether it is also possible 
to bound r^{T) by the sum of “local” regrets counterfactual regret 

in CFR (where by t^{T) we denote the “local time” at node h, or more precisely the number 
of visits of node h during SM-MCTS iterations 1,...,T). The following proposition, which is 


an immediate consequence of Theorem 21, gives a negative answer to this question. 


Corollary 32 There exists a game G and a > 0, such that for every /3 > 0, there exists a 
sequence of joint actions resulting in 


r^{t’^{T)) <fika< r^{T). 

h&H 


In particular, the inequality r^{T) < Yhh&v. does not hold and this approach which 

worked for CFR cannot be applied to SM-MCTS. Intuitively, this is caused by the differences 
between the two distinct notions of regret used by SM-MCTS and CFR. 


6. Experimental evaluation 

In this section, we present the experimental data related to our theoretical results. First, 
we empirically evaluate our hypothesis that Exp3 and regret matching algorithms ensure 
the e-UPO property. Second, we test the empirical convergence rates of SM-MCTS and 
SM-MCTS-A on synthetic games as well as smaller variants of games played by people. 
We investigate the practical dependence of the convergence error based on the important 
parameters of the games and evaluate the effect of removing the samples due to exploration 
from the computed strategies. 
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Figure 7: The Anti game used for evaluation of the algorithms. 


6.1 Experimental Domains 


Goofspiel Goofspiel is a card game that appears in many works dedicated to simultaneous- 


move games (for example Ross (1971); Rhoads and Bartholdi (2012); Saffidine et al. (2012); 
Lanctot et al. ( |2014 ); Bosansky et al. ( 2013[ )). There are 3 identical decks of cards with 
values {0,..., (d — 1)} (one for nature and one for each player). Value of d is a parameter of 
the game. The deck for the nature is shuffled at the beginning of the game. In each round, 
nature reveals the top card from its deck. Each player selects any of their remaining cards 
and places it face down on the table so that the opponent does not see the card. Afterwards, 
the cards are turned face up and the player with the higher card wins the card revealed by 
nature. The card is discarded in case of a draw. At the end, the player with the higher 
sum of the nature cards wins the game or the game is a draw. People play the game with 
13 cards, but we use smaller numbers in order to be able to compute the distance from the 
equilibrium (that is, exploitability) in a reasonable time. We further simplify the game by a 
common assumption that both players know the sequence of the nature’s cards in advance. 


Oshi-Zumo Each player in Oshi-Zumo (for example, Buro| (2004)) starts with N coins, 
and a one-dimensional playing board with 2K -\- 1 locations (indexed 0,..., 2K) stretches 
between the players. At the beginning, there is a stone (or a wrestler) located in the center 
of the board (that is, at position K). During each move, both players simultaneously place 
their bid from the amount of coins they have (but at least one if they still have some coins). 
Afterwards, the bids are revealed, the coins used for bids are removed from the game, and 
the highest bidder pushes the wrestler one location towards the opponent’s side. If the bids 
are the same, the wrestler does not move. The game proceeds until the money runs out for 
both players, or the wrestler is pushed out of the board. The player closer to the wrestler’s 
final position loses the game. If the final position of the wrestler is the center, the game is 
a draw. In our experiments, we use a version with K = 2 and N = 5. 


Random Game In order to achieve more general results, we also use randomly generated 
games. The games are defined by the number of actions B available to each player in each 
decision point and a depth D {D = 0 for leaves), which is the same for all branches. The 
utility values in the leafs are selected randomly form a uniform distribution over (0,1). 


Anti The last game we use in our evaluation is based on the well-known single player 
game, which demonstrates the super-exponential convergence time of the UCT algorithm 

. The game is depicted in Figure]^ In each stage, it deceives 
the MCTS algorithm to end the game while it is optimal to continue until the end. 


(Coquelin and Munos, 2007 
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Figure 8: The maximum of the bias in payoff observations in MCTS without averaging the 
sample values. 


6.2 e-UPO property 

In order to be able to apply Theorem (that is, convergence of SM-MCTS without av¬ 
eraging) to Exp3 and regret matching, the selection algorithms have to assure the e-UPO 
property for some e. So far, we were unable to prove this hypothesis. Instead, we support 
this claim by the following numerical experiments. Recall that having e-UPO property is 
defined as the claim that for every game node h € Ti and every joint action {i,j) available at 
h, the difference (n) — Sij (n)| between the weighted and arithmetical averages decreases 
below e, as the number n of uses of (i,j) at h increases to infinity. 

We measured the value of this sum in the root node of the four domains described above. 
Besides the random games, the depth of the game was set to 5. For the random games, 
the depth and the branching factor was B = D = 3. Figure presents one graph for each 
domain and each algorithm. The x-axis is the number of iterations and the y-axis depicts 
the maximum value of the sum from the iteration on the x-axis to the end of the run of 
the algorithm. The presented value is the maximum from 50 runs of the algorithms. For 
all games, the difference eventually converges to zero. Generally, larger exploration ensures 
that the difference goes to zero more quickly and the bias in payoff observation is smaller. 

The main reason for the bias is easy to explain in the Anti game. Figure [^presents the 
maximal values of the bias in small time windows during the convergence from all 50 runs. It 
is apparent that the bias during the convergence tends to jump very high (higher for smaller 
exploration) and then gradually decrease. This, however, happens only until certain point 
in time. The reason for this behavior is that if the algorithm learns an action is good in a 
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Iterations 



(a) Various exploration factors 


(b) Various joint actions 


Figure 9: The dependence of the current value of \sij (n) — Sij (n)| on the number of itera¬ 
tions that used the given joint action in (a) Anti game and (b) Goofspiel with 4 cards per 
deck. 


specific state, it will use it very often and do the updates for the action with much smaller 
weight in Sij (n) than the updates for the other action. However, when the other action later 
proves to be substantially better, the value of that action starts increasing rather quickly. 
At the same time, its probability of being played starts increasing and as a result, the 
weights used for the received rewards start decreasing. This will cause a strong dependence 
between the rewards and the weights, which causes the bias. With smaller exploration, it 
takes more time to identify the better alternative action; hence, when it happens, the wrong 
action has already accumulated larger reward and the discrepancy between the right values 
and the probability of playing the actions is even stronger. 


We also tested satisfaction of the UPO property in the root node of depth 4 Goofspiel, 
using Exp3 algorithm and exploration e = 0.001. The results in Figure indicate that 
Exp3 with exploration 0.001 possesses the O.OOl-UPO property, however this time, much 
higher no is required (around 5 • 10®). 


We can divide the joint actions (i, j) at the root into three groups: (1) the actions which 
both players play (nearly) only when exploring, (2) the actions which one of the players 
chooses only because of the exploration, and (3) the actions which none of the players uses 
only because of the exploration. In Eigure [9b| (1) is on the left, (2) in the middle and (3) on 
the right. The third type of actions easily satisfied \sij (n) — Sij (n)| < e, while for the second 
type, this inequality seems to eventually hold as well. The shape of the graphs suggests that 
the difference between Sij (n) and Sij (n) will eventually get below e as well, however, the 
10® iterations we used were not sufficient for this to happen. Luckily, even if the inequality 
did not hold for these cases, it does not prevent the convergence of SM-MGTS algorithm 
to an approximate equilibrium. In the proof of Proposition 
is weighted by the empirical frequencies Tj/T (or even 
only played because of exploration, this number converges to e/ (number of actions) (resp. 


the term (n) — Sjj (n)| 
Eor an action which is 
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(e/ (number of actions))^ ), so even if we had \sij (n) — Sij (n)| = 1, we could still bound 
the required term by e, which is needed in the proof of the respective theorem. 


6.3 Empirical convergence rate of SM-MCTS(-A) algorithms 

In this section, we investigate the empirical convergence rates of the analyzed algorithms. 
We hrst compare the speeds of convergence of SM-MCTS-A and SM-MCTS and then in¬ 
vestigate the dependence of the error of the eventual solution of the algorithms on relevant 
parameters. Finally, we focus on the effect of removing the exploration samples discussed 
in Section m 


6.3.1 SM-MCTS WITH and without averaging 


Figure presents the dependence of the exploitability of the strategies produced by the 
algorithms on the number of executed iterations. We removed the samples caused by 
exploration from the strategy, as suggested in Section All iterations are executed from 
the root of the game. The colors (and line types) in the graphs represent different settings 
of the exploration parameter. SM-MCTS-A (circles) seems to always converge to the same 
distance from the equilibrium as SM-MCTS (triangles), regardless of the used selection 
function. The convergence of the variant with averaging is generally slower. The difference 
is most visible in the Anti game with Exp3 selection function, where the averaging can 
cause the convergence to require even 10 times more iterations to reach the same distance 
from NE as the algorithm without averaging. The situation is similar also in Oshi-Zumo. 
However, the effect is much weaker with RM selection. With the exception of the Anti 
game, the variants with and without averaging converge at almost the same speed. 


In Section we show that the finite time convergence rate that we were able to prove 
is not very good. These experiments show that in our practical problems of smaller size, 
suitable selection of the exploration parameter allows the algorithms to converge to its 
eventual solution within 10® iterations. This indicates that the bound can be substantially 
improved. 


6.3.2 Distange from the equilibrium 

Even though the depth of most games in Figure [To| was 5, even with large exploration (0.4), 
the algorithm was often able to hnd the exact equilibrium. This indicates that in practical 
problems, even the linear bound on the distance from the equilibrium from the example in 
Section is too pessimistic. 

If the game contains pure Nash equilibria, as in Anti and the used setting of Oshi-Zumo, 
exact equilibrium can often be found. If (non-uniform) mixed equilibria are required, the 
distance of the eventual solution from the equilibrium increases both with the depth of the 
game as well as the amount of exploration. The effect of the amount of exploration is visible 
in Figure [TO^ where the largest exploration prevented the algorithm from converging to the 
exact equilibrium. More gradual effect is visible in Figures [To)(g,h), where the distance 
form the equilibrium seems to increase linearly with increasing exploration. Note that in 
all cases, the exploitability (computed as the sum of exploitability of both players) was less 
than 2 • eM, where e is the amount of exploration an and M is the maximum utility value. 
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Exploration [^O.Osj^O.I [^0.2[^0.4 [~o]sM-MCTS-A □ SM-MCTS 




(a) Anti(5), Exp3 


(b) Anti(5), RM 



(c) Goofspiel(5), Exp3 


(d) Goofspiel(5), RM 




(e) Oshi-Zumo(5), Exp3 (f) Oshi-Zumo(5), RM 




(g) Random(3,3), Exp3 


(h) Randoin(3,3), RM 


Figure 10: Comparison of empirical convergence rates of SM-MCTS (triangles) and SM- 
MCTS-A (circles) with Exp3 and RM selection functions in various domains. 
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Figure 11: Convergence of SM-MCTS with Exp3 selection on random games with three 
actions of each player in each stage and various depths. 


Exploration |- -|Keptp^ 


Removed 






(a) Anti(5) 


(b) Goofspiel(5) (c) Oshi-Zumo(5) 


(d) Random(3) 


Figure 12: The effect of removing exploration samples in SM-MCTS with Exp3 selection 
and e = 0.2. 


Figure 11 presents the average distance from the equilibrium with SM-MCTS, Exp3 
selection and exploration e = 0.2 in random games with B = 3 and various depths. The 
eventual error increases with increasing depth, but even with depth of 6, the eventual error 
was on average around 0.25 and always less than 0.3. 


6.3.3 Removing exploration 

In Section we show that the computed strategy cannot, in general, get worse when we 
disregard the samples caused by exploration. Figure shows that in practice, the strategy 
is usually improved from the very beginning of the convergence and the exploration should 
always be removed. 


7. Conclusion 

Monte Carlo Tree Search has recently become a popular algorithm for creating artificial 
game-playing agents. Besides perfect information games, where the behavior of the algo¬ 
rithm is reasonably well understood, variants of the algorithm has been successful also in 
more complex imperfect-information games. However, there was very little pre-existing 
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SM-MCTS-A SM-MCTS 

Assumptions 
Upper bound 
Lower bound 


Table 2: Summary of the proven bounds on the worst case eventual exploitability of the 
strategy produced by SM-MCTS(-A) run with an e-Hannan consistent selection function 


e-HC 

e-HC, e-UPO 

e-HC only 

2D{D + l)e 

(l2(2^-l)-8D)e 

might not converge 

to approx. NE at all 

2De 

2De 


theory that would describe the behavior of the algorithms in these games and provide 
guarantees on their performance. 

In this paper, we provide the theory and guaranteed results for the simplest, but still 
important, subclass of imperfect information games - sequential zero-sum games with si¬ 
multaneous moves, but otherwise perfect information. These games already include one of 
the major complications caused by imperfect information, which is the need to use random¬ 
ized strategies to guarantee the optimal performance. We also note that while we focus on 
games with simultaneous moves, all presented theoretic results (apart from the SM-MCTS 
counterexample from Section 5.2) trivially apply also to perfect information games with 
sequential moves. 

Our main results from Section show that a variant of Monte Carlo Tree Search algo¬ 
rithm, which we call SM-MCTS-A, in combination with any Hannan consistent algorithm is 
guaranteed to eventually converge to Nash equilibrium of the game. Moreover, if the used 
selection function, in addition to being HC, has the Unbiased Payoff Observations property, 
even the standard SM-MCTS algorithm is guaranteed to converge to an approximate Nash 
equilibrium. On the other hand, in Section we present a counterexample showing that 
there exist HC algorithms, which converge with SM-MCTS-A, but not with SM-MCTS. 

More detailed results are summarized in Table § In Theorem we show that 

SM-MCTS-A (SM-MCTS) algorithm with e-HC selection function eventually converges at 
least to Ce-NE of a game, where for game depth D, C is of the order (2^). In Section]^ 
we show that the worst case dependence of C on L) cannot be sublinear, even after the 
exploration is removed. This gives us both lower and upper bounds on the value of C, but 
it remains to determine whether these bounds are tight. We form a hypothesis that the 
tight bound is a linear dependence C = 3D (C = 2D after the exploration is removed). 

In Section we provide an analysis of previously suggested improvement of SM-MCTS 
algorithm, which proposes the removal of samples caused by exploration from the played 
strategy. We prove both formally and empirically that this modification of the algorithm is 
sound and generally improves the strategy. 

In Theorem 14 we show that, for a fixed confidence level, SM-MCTS-A algorithm con- 

2 

verges to the given Ce-equilibrium at rate at least 1/T~d^ . This estimate is most likely 
overly pessimistic, as suggested by the empirical results. 

Finally, we provide empirical investigation of the algorithms that shows that in practical 
problems, the convergence times as well as the eventual distance from the Nash equilibrium 
is better than the theoretic guarantees. We show that SM-MCTS-A converges slower than 
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SM-MCTS with the commonly used HC selection functions, but they converge to the same 
distance from the equilibrium. Moreover, the difference in convergence speed is smaller with 
regret matching than with Exp3 and in many domains, it is negligible. 

While this paper provides a significant step towards understanding of MCTS in simulta¬ 
neous move games, it also leaves some problems open. First of all, many of the guarantees 
presented in the paper have not been shown to be tight so an obvious future research direc¬ 
tion would be to improve the guarantees or show the tightness of these results. Also, better 
characterization of the requirements (on the selection function) which guarantee conver¬ 
gence with SM-MCTS algorithm could be provided. For example, it would be interesting to 
formally prove that the common Hannan consistent algorithms guarantee unbiased payoff 
observations or a similar property that is sufficient to guarantee convergence of SM-MCTS 
in this setting. 

Furthermore, MCTS algorithms are generally used with incremental building of the 
search tree and a problem-specific heuristic simulation strategy outside of the portion of 
the search tree in memory. It would be interesting to analyze the behavior of the algorithms 
with respect to basic statistical properties of the simulation strategy, such as its bias and 
variance. Lastly, our analysis of simultaneous move games can be used as a basis for 
analyzing MCTS as it is used in more general classes of imperfect information games. 
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Appendix A. Proofs 

In this section we present the proofs for those results, which have not been already proven 
in the main text. We start with the proof of Lemma which states that eventually, there 
is no difference between empirical and average strategies. 

Proof [Proof of Lemma Is] It is enough to show that limsup z) — z)| = 0 holds 

I_I t^OO 

almost surely for any given i. Using the definitions of ai{t,i) and (fi{t,i), we get 

if * \ 1 ‘ 

ai{t,i) - ai(t,i) = - I ti-'^ai(s,i) j = i^iAs) -cri{s,i)) , 

where 6ij is the Kronecker delta. Using the (martingale version of) Central Limit The¬ 
orem on the sequence of random variables Xt = gives the result 

(the conditions clearly hold, since E — ai{t,i)\Xi, ...,Xt-\\ = 0 implies that Xt is a 

martingale and — (Ji{t,i) G [—1,1] guarantees that all required moments are finite). ■ 


Next, we prove Lemma [^- which states that e-Hannan consistency is not substantially 
affected by additional exploration. 

Proof [Proof of Lemma Denoting by * the variables corresponding to the algorithm A* 
we get 


r*{t) 


\w{t)<\{l-t 

^ex ^ R(t ^ex) 
t t ^p-y 


ex + Ll{t — tex)) 
t lex. 
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where, for given f G N, tex denotes the number of times A* explored up to t-th iteration. By 
Strong Law of Large Numbers we have that lim ^ ^ holds almost surely. This implies 

t—^oo ^ 


lim sup r* (t) 

t —>00 


< 

< 

< 


^0x 1 ^0x) 1 i 

Inn sup-h hm sup -• hm sup- 

t—>00 t t —iex—^ ^ex t—^00 t 

7 + e(l - 7 ) 

7 + e, 


which means that A* is (e + 7 )-Hannan consistent. The guaranteed exploration property 
of A* is trivial. ■ 


A.l Proofs related to the convergence of SM-MCTS-A 

In this section we give the proofs for lemmas which were used to prove Theorem We 
begin with Lemma which established a connection between average payoff g and game 
value V for matrix games. 

Proof [Proof of Lemmaj^ It is our goal to draw conclusion about the quality of the empirical 
strategy based on information about the performance of our e-HC algorithm. Ideally, we 
would like to somehow relate the utility u{a) to the average payoff g{t). However, as this 
is generally impossible, we can do the next best thing: 


u{br, 0-2 (t)) 


max 

i 


j j 


1 

-max 

t i 




'J ^ij 


1 

-max 
t i 


t 

S = I 


1 

t 




{i) — <?max(t)- 


( 11 ) 


Step 1 : Let rj > 0. Using e-HC property gives us the existence of such to that gmax{t) — 
g{t) < e-|- ^ holds for all t > to, which is equivalent to g{t) > gmax — (e + f)- However, in our 
zero-sum matrix game setting, gmax is always at least v, which implies that g{t) > u —(e-|-|). 
Using the same argument for player 2 gives us that g{t) < u -|- e Therefore we have the 
following statement, which proves the inequalities ([^: 

\/t > to : V — {e + -) < g{t) < u -|- e ^ holds almost surely. 


Step 2: We assume, for contradiction with inequalities Q, that with non-zero proba¬ 
bility, there exists an increasing sequence of time steps tn 00 , such that u {hr,a 2 {tn)) > 
V + 2e + g for some g > 0. Combing this with the inequalities, which we proved above, we 
see that 


lim sup r (t) > 

t—^OO 


> 


limsupr (t„) = lim sup (^max (tn) - g [tn)) 
n^oo n^oo 

lim sup {u {br, (72 {tn)) - g (tn)) 

n^oo 

u-|- 2 e -|-7 — (u-|-e-|- g/2) = e + g > e 
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holds with non-zero probability, which is in contradiction with e-Hannan consistency. ■ 


Remark 33 In the following proof, and in the proof of Proposition \1S{ we will be working 
with regrets, average payoffs and other quantities related to matrix games with error, in 
which we have two sets of rewards - the rewards aij corresponding to the matrix M and 
the “observed” rewards aij{t). We will denote the variables related to the distorted rewards 
aij{t) by normal symbols (for example gmax{t) = max* j X]s=i ®ij(s)('S) and use symbols with 
tilde for the variables related to the rewards aij (for example g^ax{t) = maxj j aiji^s))- 

Proof [Proof of Proposition!^ This proposition strengthens the result of Lemmaand its 
proof will also be similar. The only additional technical ingredient is the following inequality 
@1 

Since M{f) is a repeated game with error ce, there almost surely exists t^, such that for 
all t > to, \aij{t) — Oijl < ce holds. This leads to 


\gmax{t) - gmax{t)\ < max 

I 




S = 1 


^ Iq , t to t—>-oo 

< -h ce •- —ce. 

t t 


( 12 ) 


The remainder of the proof contains no new ideas, and it is nearly exactly the same as the 
proof of Lemma therefore we just note what the two main steps are: 

Step 1 : Hannan consistency gives us that 


Vr/ > 0 3to G N Vt > to : g {t) > v - {e + rj) - \graax {t) - gmax (t)| holds a.s., 
from which we deduce the inequalities 


u — (e -|- ce) < liminf g{t) < limsup 5 ((t) < u -|- e -|- ce. 

t^oo t^oo 


Step 2 : For contradiction we assume that there exists an increasing sequence of time steps 
tn oo, such that 

u{hr,a2{tn))>v + ‘^{c+l)e + ri (13) 


holds for some r/ > 0. Using the identity g^ax (tn) = u {br, a 2 {tn)) and inequalities (12) and 
(13), we then compute that the regret r(t„) is too high, which completes the proof: 


e > lim sup r (t) 

t^OO 


> 



limsupr (t„) = lim sup {g^ax (tn) - g {tn)) 

n^QO n^oo 

lim sup {u{br,a 2 {tn)) - \g max{tn) 5max(^n)|) 
n —>00 

{v + 2(c -b l)e -|- r/) — ce — (u -|- e -|- ce) 
e + r] > e. 


lim snp g{tr, 
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A.2 Proofs related to the convergence of SM-MCTS 


In Section 3.2 we gave the proof of convergence of those SM-MCTS algorithms, which were 
based on e-UPO selection functions. It remains to prove Proposition 19 which establishes 
a connection between regrets R and R of the selection function with respect to the observed 
rewards and with respect to the exact subgame values. The goal is to show that if R{T) is 
small and algorithm A is e-UPO, then the regret R{T) is small as well. 

Proof [Proof of Proposition 19 Let A be a e-HC algorithm with e-UPO property. Recall 
- by R (T) we denote the regret of action sequence i{t) chosen by A against 


33 


here Remark 

the adversary’s action sequence j {t) in the matrix game {vij ) - that is 

T T 

R{T) = max Vi*j{t) “ vaAxSi*{T), 

* 1=1 1=1 * 

and by R(T) we denote the “observed” regret 

T T 

R{T) niax ^ y (^*(bi(^)) ' va.axSi*(T'). 


t=i 


t=i 


To prove the proposition it is sufficient to assume that limsup'^^ R (T) /T < e and show that 
limsup-p R (T) jT < 2 (c -|- 1) e holds almost surely. Let i* be an action of player 1. Denote 
by a) the fact that, by definition of Wij (n), Tj = Ylrn=iWi*j holds for each T and j, 
and by b) the equivalence <;=> (j(t) = j k. = rn). We can 

rewrite Si* (T) as follows: 

T T 

Si* (T) = {ti*j[t)) — 


t=l 


t=l 


b) 


a) 


Y2 (™') ^ ^1 = VI 

j m=l 


E 


kj{t) = j}| -EE%( 

i^j m=l 

Tij 

YZ iva) Wij (vT^) -YZ'^.YZ 


m 


3 Emil W'iU ("^) m=l 

YZ “ YZ 




m=l 


13) 


^,3 


YZ'^^ ^Vi*j T ^i*j Vi*j) ^ ^ Tjj (vij -\- Sij (Tij') Vij^ 




YZ YZ YZ YZ 

j m=l ij m=l 

T T 

YZ ^**i(*) ~ YZ ’ 

t=l 1=1 

Si* (T) + w* (T) 
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where 


Xi* (T) := ^ Tj (si*j - Vi*j) - ^ Tij (s^ (T^) - Vij) 




In particular, we can bind the regret R{T) as 

R{T) = max Si* (T) = max (5** (T) - Xi* (T)) < R{T) + max |Xi* (r)| . 

i* i* i* 

Clearly Xj* (T) satisfies 


X** (T) 


T 


< rji \^i*j I'Sjj (^ij) 






Using the e-UPO property and the assumption that limsup^ (n) — Vij\ < ce holds a.s. 
for each i,j, we get 


lim sup 

T^oo 


X** (T) 


< limsup^ \si*j {Ti*j) - Vi*j\ + 


T^oo 




+ ^ \sij (Tij) Vij\ 




< (c + 1) elimsup^ + 


T^oo ^ T 
3 

R, 


+ce lim sup j:^=(2c+i)£ 


T^oo “ R 


Consequently, this implies that 


limsupii (T)/T < limsupi? (T)/r + limsuprnaxXi* (T)/T 


T^oo 


T^oo T^oo 

< e + (2c + 1) e = 2 (c + 1) e 


holds almost surely, which is what we wanted to prove. 


A.S Details related to the counterexample for Theorem 15 


In Section [53] (Lemma|30|) we postulated the existence of algorithms, which behave similarly 
to those from Example but unlike those from Example the new algorithms are e-HC. 
Eirst, we define these algorithms and then we prove their properties in Lemma [36) Lemma 


30 which was needed in Section 5.2 follows directly from Lemma 36 


Remark 34 In the following description of algorithm A/, ch denotes how many times the 
other player cheated, while ch is the average ratio of cheating in following the cooperation 
pattern. The variables ch and ch then serve as the estimates of ch and ch. lEe present the 
precise definitions below. The nodes I, J, actions X, Y, L, R, U and D and the respective 
payoffs refer to the game G from Figure 
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Definition of the algorithm 

Fix an increasing sequence of integers bn and repeat for n G N: 

1. Buffer building Play according to some e-HC algorithm for bn iterations (con¬ 
tinuing with where we left of in the (n — l)-th buffer building phase). 

2. Cooperation Cn- Repeat U,U,D,D for t = 1,2,... and expect the other player to 
repeat L, R, R, L. At each iteration t, with probability e, check whether the other 
player is cooperating - that is play D instead of U resp. U instead of D, and if 
the payoff does not correspond to the expected pattern the second player should be 
following, set ch (t) = ^. If the other player passes this check, or if we did not perform 
it, set ch (t) = 0. 

3. End of cooperation (might not happen): While executing step 2, denote ch{t) := 
\ Z^s=i ('5)- Once t satisfies 


bn ~\~ t bn -\-1 

we check at each iteration whether the estimate ch (t) threatens to exceed 2e during 
the next iteration or not. If it does, we end the cooperation phase, set n := n + 1 and 
continue by the next buffer building phase. 

4. Simulation of the other player: While repeating steps 1, 2 and 3, we simulate 
the other player’s algorithm (this is possible, since from the knowledge of our 
action and the received payoff, we can recover the adversary’s action). If it ends the 
cooperation phase and starts the next buffer building phase, we do the same. 

5. Unless the cooperation phase is terminated, we stay in phase Cn indefinitely. 

Definition of The algorithm is identical to Af, except for the fact that it repeats 
the pattern U, D, D, U instead of L, R, R, L and expects the other player to repeat L, R, 
R, L. 

Definition of A^: The algorithm A^ is a straightforward modification of Af, with 2e in 
place of e - it repeats the sequence Y, X, X, Y and expects to receive payoffs 0, 0, 0, ... 
whenever playing X and payoffs 1, 0, 1, 0, ... when playing Y. However, whenever it 
deviates from the Y, X, X, Y pattern in order to check whether these expectations are met, 
it plays the same action once again (in order to avoid disturbing the payoff pattern for Y). 

Remark 35 The steps 2 and 3 from the algorithm description are correctly defined, because 
the opponent’s action choice can be recovered from the knowledge of our action choice and 
the resulting payoff. Regarding step 3, we note that the condition here is trivially satisfied 
for t = 1,ebn, so the length tn of the cooperation phases Cn tends to infinity as n —)• oo, 
regardless of the opponent’s actions. 

Lemma 36 (1) When facing each other, the average strategies of algorithms Afi A( and 
A 2 will converge to the suboptimal strategy = a( = = ( 5 , 2 )- However the algorithms 

will suffer regret no higher than Ce for some C > 0 (where C is independent of e). 
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(2) There exists a sequence bn (controlling the length of phases Bn), such that when fac¬ 
ing a different adversary, the algorithms suffer regret at most Ce. Consequently Af Af and A 2 
are Ce-Hannan consistent. 


Proof Part (1): Note that, disregarding the checks made by the algorithms, the average 
strategies in cooperation phases converge to 5 ) • Furthermore, the probability of making 
the checks is the same at any iteration, therefore if the algorithms eventually settle in 
cooperative phase, the average strategies converge to ( 5 , 5 ). 


Step (i): The conclusion of (2) holds for A), i = 1,2. 

We claim that both algorithms Af, i = 1,2 will eventually settle in the cooperative 
mode, thus generating the payoff sequence 1 , 0 , 1,0 during at least (1 — e)^-fraction of the 
iterations and generating something else at the remaining 1 — (1 — e)^ < 2e iterations. It is 
then immediate that the algorithms Ad suffer regret at most 2 e (since is 2 e-equilibrium 
strategy). 


Proof of (i): If the other player uses the same algorithm, we have E ch[f) 


= e and 


Var ch{t) < -^ < 00 , thus by the Strong Law of Large Numbers ch (f) —)• e almost surely. 
In particular, there exists to G N such that 


Pr 


yt >to : ch {t) < 2e 


> 0 . 


In Remark |35| we observed that the cooperative phase always lasts at least ebn steps. There¬ 
fore once we have bn > jto, there is non-zero probability both players will stay in ri-th co¬ 
operative phase forever. If they do not, they have the same positive probability of staying 
in the next cooperative phase and so on - by Borel-Cantelli lemma, they will almost surely 
stay in Cno for some no G N. 


Step {ii): A^ will eventually settle in the cooperative mode. 

Proof of [ii): This statement can be proven by a similar argument as (i). The only 
difference is that instead of checking whether the other player is cheating, we check whether 
the payoff sequence is the one expected by A^. The fact that A^ settles in cooperative mode 
then immediately implies that A^ will repeat the Y, X, X, Y pattern during approximately 
(1 — 4e)-fraction of the iterations (we check with probability 2e and we always check twice). 
It is then also immediate that A^ will receive the expected payoffs during at least (1 — e)- 
fraction of the iterations (the payoff is always 0, as expected, when X is played and it 
is correct in at least 1 — 2e cases when playing Y. X and Y are both played with same 
frequency, which gives the result.). 


Step {Hi) : If Ad, i = 1,2 and A^ settle in cooperative mode, then A^ suffers regret 
at most Ce. 

Proof of {in): We denote by a {t) the t-th action chosen by A^ and by {a*{s)) the 4-periodic 
of sequence of actions starting with Y, X, X, Y. Moreover we denote by sy (n) the n-th 
payoff generated by Aj and A 2 and by xy {t) the number sy {ty). Set also xx {t) = 0 for 
each t G N. Finally we denote by (x^(s)) the 4-periodic of sequence of payoffs starting with 
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1, 0, 0, 0. Our goal is to show that the average regret 

^ = 7 ( 

Vs=l s=l 

is small. As we already observed in Example if neither the payoff pattern 1,0,1,0,... 
from node J, nor the action pattern Y, X, X, Y, ... at node I are disturbed, then we have 
(a(s)) = (a*(s)) and (xy(s)) = (xy(s)), and therefore suffers no regret. Consequently, 
we have 

^ I{ 1 > \ {'S ^ *1 (^(s) = a*{s) & xy(s) = xy(s)}| 

< ^ |{s < t\ a{s) / a*(s)} U {s < t| xy(s) / Xy(s)}| . (14) 

Firstly recall that when deviates from its action pattern, it does so twice in a row. 
Since the sequence sy is 2-periodic, any disturbance of a (t) and a (t + 1) might change the 
payoffs xy (t) , xy (t + 1) but it will affect none other. By (ii) this change of action a {t) 
concerns at most a 4e-fraction of the iterations and so we have 

lim - |{s < t\ a{s) / a*(s)}| < 4e. (15) 

t^OO t 

Secondly when sy (n) deviates from the expected pattern, this affects those iterations t 
for which ty = n. This will typically be 2 iterations, unless A^ was doing its checks - those 
are done with probability 2e, therefore with probability 2e, 4 iterations are affected, with 
probability (2e)^, 6 iterations are affected and so on... Since we are interested in the limit 
behavior, we can assume that at average no more than ^ ^ iterations are 

affected by each disturbance to sy (assuming, of course, that e < j). By (z) we know that 
no more than 2e-fraction of numbers sy(s) will be changed. Consequently no more than 
4 • 2e-fraction of payoffs xy(s) will be changed and we have 

lim - |{s < t\ xy(s) 7 ^ Xy(s)}| < 8 e. (16) 

t^oo t 

Putting these two information together, we see that 



f l 

_ limsup - |{s < t\ a{s) ^ a*(s)}| + 

t—^oo t—^oo t 


+ limsup - |{s < t\ xy{s) / Xy(s)}| 
t^oo t 



4e + 8e = 12e 


which is what we wanted to prove. 


Part (2): Algorithms A/, z = 1,2 are Ce-HC. 

a) Firstly, we assume that both players stick to their assigned patterns during at least 
(1 — 2e)-fraction of iterations (in other words, assume that limsup ch{t) < 2e). By the 
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same argument as in ( 1 ), we can show that the algorithms will then suffer regret at most 
Ce. 

b) On the other hand, if limsup ch (t) > 2e, the algorithm will almost surely keep switching 
between Bn and Cn (consequence of Strong Law of Large Numbers). Denote by ri,^n and 
rc,n the regret from phases Bn and Cn, recall that bn, tn are the lengths of these phases 
and set 


rn ■■= 1 - —n,n + - —'>'c,n = Overall regret in Bn and Cn together. 

On + tn bn + tn 

Finally, let r = limsup r (t) denote the bound on the limit of regret of A/. We need to 
prove that r < Ce. To do this, it is sufficient to show that limsup^ is small - thus our 
goal will be to prove that if the sequence bn increases quickly enough, then limsup Vn < Ce 
holds almost surely. Denote by (F„) the formula 

Vt > -bn ■ ch (t) < 2e =► ch (t) < 3e. 


We know that 


ch 


ch 


—)• 0 a.s., therefore we can choose bn such that 


Pr [(F„) holds] >1-2"’^ 

holds. Since 2”"' < oo, Borel-Cantelli lemma gives that (Fn) will hold for all but finitely 
many n G N. Note that if {Fn) holds for both players, then their empirical strategy is at 
most 3e away from the NE strategy, and thus Cc^n < Ce. Since is a convex combination 
of Tf, „ and Vc^n and in Bn we play e-consistently, we can compute 

r < limsup < max {limsup limsup Vc^n} < max{e, Ce} = Ce. 

The proof of Ce-Hannan consistency of is analogous. ■ 


A.4 Proofs related to the finite time bound for SM-MCTS-A 

In order to get a bound on finite time behavior of SM-MCTS-A, we need the following finite 
time analogy of Proposition 

Lemma 37 Let e>0,c>0,6>0be real numbers and let {aij{t)) be a repeated game with 
error ce played by e-Hannan consistent players (using the same algorithm A). Denote by ti 
the time needed for algorithm A to have average regret r{t) bounded by e for every t > ti 
with probability at least 1 — 5, by t 2 the time such that \aij{t) — aij\ < ce for every t > t 2 . 
Set to = max {ti, e~^t 2 }. Then with probability at least 1 — 26 for each t >to we have 

V — {c-\-2) e < g (t) < V {c2) e 

and no player can gain more than 2 (c -|- 2) e utility by deviating from strategy a (t) in the 
matrix game M. 
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Remark 38 The proof is a simple modification of Proposition 11. The difference between 
this lemma and Proposition [7^ is the introduction of 5 and to • The important part of the 
claim is the value of the constant to. 


Proof As in equation ( 12 ) of the mentioned proof, we get for t >t 2 

t 


Ifl'max(i) - 5max(t)| < max 


E 

S = 1 




)W) 


^ 1 ^2 , t-t2 

< 1 •-h ce •-. 

t t 


If we now take t > to, we have t > e ^t 2 , and thus 


Ifi'max(t) 5max(t)| < +Ce- — (c+l)e. 

e ^12 t 

Next step is the same as in the previous proof, except that we replace ce by (c + l)e. This 
leads to the following analogy of the inequality ([^, which holds with probability at least 
1 - 26: 

Vt > ti : u — (c + 2) e < g' (t) < u + (c + 2) e. 

Remainder of the proof also proceeds as before, thus we only present the main steps: Assume 
that 

u{br,a 2 {t)) > V+ 2{c + 2) e + T] (17) 

holds for some t >to >ti and r] > 0. We get the inequality 

5max (t) > V + {c + 2) e + e + T], 
which we combine with inequality 


g{t) <v + {c + 2)e 


in order to get 

r {t) = ffmax {t) -g{t)>[v + {c + 2)e + e + r]]-[v + {c + 2)e] = e + rj> e. (18) 


Since both players are e-HC and t > ti, we know that the inequality (18) (or its analogy 
for the second player) cannot hold with probability higher than 26, and thus neither can 
the inequality (17) (or its analogy for the second player). This concludes the proof. ■ 


In the proof of Theorem 14, we will need to guarantee that each of nodes in depth d of 
game tree gets visited at least T-times with high probability, for some T G N. Because of 
this, we include the following technical notation, and related Lemma 40 Firstly, we present 
the notation and the corresponding lemma, then we proceed to give the necessary details. 


Notation 39 Let 6,'y > 0 and 6 G N. For integers T and d, we denote by t{T, d) the 
smallest number, such that if algorithm with exploration rate 7 is used in a game with 
branching factor b for t{T, d) iterations, then with probability at least 1 — 6, each of the 
nodes in depth d will be visited at least T-times. 
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Lemma 40 Let <5, 7 > 0 and 5 G N and assume that T G N satisfies T > 4 log 6 ^ + 4 log 2. 
Then we have 


t{T, d) < 16 


1 \ 

-] log2b^-^ ■ b^-^T. 


Remark 41 Consider a game with branching factor b and depth D and an algorithm A 
which explores with probability 7 > 0. For the purposes of this section, we will say that the 
root is in depth d = 1, there are at most b nodes in depth d = 2, 6 '^“^ nodes in depth d, 
d < D and the whole game tree FL has got at most |'1^| = 1 + 6+...6^“^ nodes. For simplicity 
we assume that the game tree is already built. At each iteration, there is probability 7 '^“^ 
that the algorithm A will explore on every level of the game tree between the root and d-th 
level. This means that with probability 7 “^“^ one of the nodes on d-th level will be chosen 
with respect to the uniform distribution over all of these 6'^“^ nodes. 

Let n, t G N. We denote by U{n) the uniform distribution over the set {l,...,n}, let 

Xs ~ U (n) for s = l,...,t be independent random variables with the same distribution as 
U (n) and set mU{n,t) = min ^ (^s = 0- Moreover, let S and consider S inde- 

l<i<n 

pendent random choices whether to explore with probability 7 ^^“^ (or else play accordingly 
to some other strategy with probability 1 — 7 *^“^^. Out of these S iterations, we will choose 
to explore S many times. Using this notation, the number t{T, d) satisfies 

t{T,d) = min|5 G N| Pr[ml7(6'^"\ 5) > T] > 1 - 6 .} . 


Proof [Proof of Lemma 40 Denote T = 8 log (2n) nT. Firstly we show that Pr[mU (n, T) > 


T] > 1 — 6/2, then we follow with the inequality Pr 


S > h 




> 1 — 6/2. Putting these 


two inequalities together with n = b‘^ ^,5 = 2^^j T gives the result. 

Step 1 : For the first part, we find u G N, such that p = Pr[mf7(n,u) > 1] > ^. Using 
elementary combinatorics we get that 1— p = Pr[ml7(n, u) = 0] < 1 — n • and thus 



n 


n — 1 


n 


1 

> - 
“ 2 


exp ( u log- 1 > 


n 

T n — 1 , 1 

u log- > log — 

n 2 n 

u > log 2 n • log ^1 + 

u > log 2n ■ -j— 

2 ■ hXl 

u > 2 relog 2 n. 


1 

2n 


1 


n — 1 


Dividing T into 4T blocks of length 2n log 2n gives us 4T independent “trials” with 
success probability at least ^, where success increases of mU by at least one in given block. 
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This means that Pr[mC/(n,T) > T] > Pr[i?(4r, > T], where denotes binomial 

distribution. Chernoff bound for binomial distribution gives us 


Pr[B{AT, -) > T] > 1 — exp 


1 {2T-Tf\ 

^ 4r ) 


1 — exp (—r/4). 


Therefore, the conclusion holds for integers T G N satisfying exp(—T/4) < 6/2, which is 
equivalent to T > 4 log 6~^ + A log 2. 

Step 2 : For the second part, we know that S ^ B{S,'y'^ is a binomially distributed 
variable. Using Chernoff bound, we have (writing p = 


Pr[5>|5] = l-Pr[BiS,p)<^S 


> 1 — exp — 


1 {ps-py 


= 1 — exp-- —S . 


2p 

d-i 


S 


7 


d-l 


Choosing 5" > 2 ( M T, we get both %S >T and 


„d-l 


1 — exp (- ^—S ) > 1 — exp ( — 


(-r/ 4 ) > 1 - exp (-T/4) > 1 - 6/2. 


We are now ready to give the following proof: 

Proof [Proof of Theorem 14 Fix <5 > 0 and let denote the time needed for A to have 
average regret smaller than 2e’ with probability at least 1 — 6. 

Set Ti = Ta- For a fixed node h in depth D, we have by Lemma 37 that the inequalities 
Vh — 2e < gh{t) < Vh + 2€ (since dh = D, we can represent the situation at h as a repeated 
game with error ce = 0 and t 2 = 0) hold for t >Ti = max jTi, e~^ ■ O} with probability at 
least 1 — 26. However, we need these inequalities to hold for all nodes in depth D at 
once. To guarantee that we visit all of them at least Ti many times with probability at least 
1 — 26, we need t > T 2 , where T 2 = t{e~^Ti, D). Taking product of all these probabilities, 
we get that all of the mentioned conditions hold at the same time with probability at least 




(1 — 6). We are done with D-th. level of the game tree. 


(1 - 26y 

Assume that everything above holds. Note that surely T 2 > T^. By Lemma 37 (to = 
max {r^, e“^r 2 } = e“^r 2 , c = 1) we know that if we visit a node h in depth D — 1 at least 
e~^T 2 many times, the inequalities Vh — Ae < gh{t) ^ Vh + Ae will hold with probability at 
least 1 — 26. Again we find such a number T 3 that will guarantee that we visit each of these 
^Z )-2 j^odes at least e“^T 2 many times with probability at least 1 — 6 (as in paragraph above 
T 3 = t(e“^r 2 , D — 2)). The probability of all conditions mentioned in this paragraph being 
satisfied at once will be (1 — 26)^ (1 — (5). 

We continue by induction and receive numbers T^, T^,...,T£), which satisfy Td+i — 
t{e~^Td,d). We now calculate the exact numbers for probability, equilibrium distance and 
time bounds: 
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Taking product of probabilities (1 — 2<5)^ (1 — 5) for d= 1,D gives us 


- 2.^2(b^-i + ...+fe2+6+l) .-I _ ^ .-I _ 


(1 - 25) 


(1 - 5)^ = (1 - 25)21^1 (1 - 5)^ > 1 - (2 \'H\ + D) 5. 


D 


On [D + 1 — d)-th level of the tree, we have — (q + 2) e < gh{t) <Vh + (q + 2) e, where 
Cd = 2d — 2 (easily checked by induction, as Cd+i = Cd +2 and ci = 0). Lemma implies 
that no player can gain more than 2 (cd + 2) e = 4de by changing his action on given level 
only. Note that these possible deviations sum to 


D 


J2^de = 2D{D + l)e. 


(19) 


d=i 


Finally we calculate the value of Td by substituting from Lemma 40 


D 


td =7a • n • 16 


d=2 


7 


d-l 


d—1 ud—l 


log 26'*-^ • b' 


n 1 / h\ 

=Ta • (I6e-^)^-' log (26 + ... + 26^-^) 

f-J log (2 \n\ - 2). 

With probability at least 1 — (2 |?^| + Zl) 5 we get (by p^) that for every t > Tjj no 
player can gain more than 2D (D + 1) e utility by choosing a different action in any (or 
all at once) node of the tree. Multiplying this result by two, we get that the empirical 
frequencies will form 444 (D + 1) e-equilibrium for every t > To with probability at least 
1 — (2 I'Hl + Z4) 5 - the proof is finished. 

Note that in the proof above, we assumed that SM-MCTS-A algorithm starts when the 
game tree is already built. If we wanted to start with an empty game tree, we would have 
to increase the time bound by additional constant, such that with high probability, we first 
visit all 6 nodes one level below the root at least once, then we visit all 6^ nodes in the next 


level, and so on. By Lemma 40 this constant would be equal to t{l, 1) -|- ... + a 


number which is negligible when compared to other parts of the finite time bound. 
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